Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

15 billion row text file and row deletes - Best Practice?

by awohld (Hermit)
on Dec 01, 2006 at 05:03 UTC ( [id://587112]=perlquestion: print w/replies, xml ) Need Help??

awohld has asked for the wisdom of the Perl Monks concerning the following question:

I have a 15 billion row text file that is someting like this:
serial,name,yes_no 00020123837,,y
I have another file that has a list of serials to delete. When I find a serial to be deleted in the 15 billion row file, then I'll delete that serials row out of the 15 billion row text file.

Without using a DB, what's the best way to search this 15 billion line file for specific serials and when found, delete that line?

The total size of this file is approx 380 GB and I have 2GB of RAM. If this takes a week to finish it's not a probelm.

Replies are listed 'Best First'.
Re: 15 billion row text file and row deletes - Best Practice?
by bobf (Monsignor) on Dec 01, 2006 at 05:14 UTC

    How many serials are in the "file that has a list of serials to delete"?

    If it is a relatively small number, you could read them into a hash. Then you could read through the 15 billion line file line-by-line (thereby avoiding the need to keep the whole thing in memory at once); if the line's serial is in the "delete" hash then read the next line, otherwise print it to a new output file.

    You'll need enough drive space to accommodate the new output file, of course, but it would accomplish the goal without using a db and with only minimal memory requirements, and it would still require only a single pass through the 15 billion line file.

      If it is a relatively small number, you could read them into a hash.

      If it's a big number, you could read them into a disk-based hash like BerkeleyDB. It's a lot slower than an in-memory hash, of course, but it would make the code pretty easy to write.

      If it was me, though, I'd probably use a database.

        That was my first thought, too (well, almost - I thought of DBM::Deep because I played with that one before).

        I would have suggested a db if the OP didn't clearly state "Without using a DB..." :-) It also depends on how this data is being used/processed, though. If this is a one-time filtering step then loading it all into a db, filtering, and exporting again could be very inefficient. OTOH, if this is just one of many steps that require searching though the data, a db could be better.

        The 15 billion serials are unique. I'm just nervous about using one since I don't know how big the overhead will be. I have 680GB on disk free so maybe I should use a DB.
Re: 15 billion row text file and row deletes - Best Practice?
by imp (Priest) on Dec 01, 2006 at 05:46 UTC
    If you have to work with the text file then I would recommend using sed instead of perl.
    # If your version of sed supports editing in place sed -i -e '/^00020123837$/d' somefile.txt #Otherwise sed -e '/^00020123837$/d' somefile.txt > tmp.txt mv tmp.txt somefile.txt
    If this is done regularly or other maintenance work is going to be done then a database becomes a much more attractive option.
      And bear in mind that sed supports the use of an "edit script" file -- one could take a list of patterns that should be deleted from a file, and turn that into an edit script. Based on the OP's description, the list of serial numbers to kill could be saved as:
      /0001234/d /0004567/d /0089123/d ...
      If that's in a file called "kill.list", then just run sed like this:
      sed -f kill.list big.file > tmp.copy mv tmp.copy big.file
      On a file of a few hundred GB, I agree that using sed for this very simple sort of editing would be a significant win (save a lot of run time).
Re: 15 billion row text file and row deletes - Best Practice?
by grinder (Bishop) on Dec 01, 2006 at 10:04 UTC

    One alternative way to look at the problem requires co-operation from the downstream processes, and avoids having to create a copy of the file.

    The idea is to open the file for read/write, and instead of deleting the record (which really means "not writing it to the new file") you seek to the beginning of the record and replace the serial number with dashes or tildes, or whatever it takes for a downstream process to ignore it.

    This is akin to how DOS deleted files: it simply replaced the first character of the filename by a twiddly character, and the directory read-first/read-next system calls knew to ignore them when asked to return the files in a directory.

    If you have relatively few serial numbers to delete you can store them in a hash (to the extent that "relatively few" fits comfortably in 2Gb of space).

    If you have more have more than that, you may find that assembling them all into a regexp (with Regexp::Assemble) may yield a tractable pattern. The economy of sharing the common prefixes of many serial numbers means that the pattern won't be as big as you'd think. Then you want to see if the line matches the regexp, rather than seeing if the field exists in the hash.

    • another intruder with the mooring in the heart of the Perl

      This is somewhat similar to what I would suggest, except that instead of marking certain records as having been deleted, I'd keep two pointers into the file, one for read and one for write, then go through the entire file line by line reading and writing, but just not writing if I find one of the unwanted serial numbers. This approach will involve reading the whole file (but only a line at a time, so won't need much memory) and writing the whole file, so will take just as much time as writing the wanted lines to a new file and then renaming it, but it won't require 380GB of intermediate storage.

      Don't forget to truncate() the file when you've finished writing the last record, or you'll end up with rubbish tacked onto the end.

Re: 15 billion row text file and row deletes - Best Practice?
by bart (Canon) on Dec 01, 2006 at 11:29 UTC
    I'd think of something structured like this:
    my %kill; @kill{'00020123837', '00020123839'} = (); while(<>) { my($serial) = split /,/; print unless exists $kill{$serial}; }
    This way, you make a copy of the text file (to STDOUT), and only print out those lines you want to keep.

    It'll only go through the file once, for the whole job, but you will need your extra disk space, as the copy will be almost as large as the original.

    You will have to fill the hash with serials to kill, somehow — not necessarily the way I show here. You might read them from a database or other source file. The hash values are not important (they're undef here), only the keys matter.

    If you want to make this a oneliner, think of using the switches -n (to loop through the input file without printing) and -i (to replace the original file with the output file when finished. Something like (using Unix quotes for the command line):

    perl -n -i.orig -e 'BEGIN{@kill{"00020123837","00020123839"}=()} my($s +erial)=split/,/; print unless exists $kill{$serial}' myhugefile.txt

    Swap the single and double quotes for on Windows.

Re: 15 billion row text file and row deletes - Best Practice?
by reasonablekeith (Deacon) on Dec 01, 2006 at 13:39 UTC
    You don't mention if either of the files are sorted. If they are both sorted by serial number, then it doesn't matter how big either file is, just that you've got enough room for the output file

    A possible algorithm would be to read one serial number to remove from the exclude file.

    You then go through the main file one by one, copying each line to the out put file until you've either found the serial number you want to remove, or have gone past the serial number you picked from the exclude file.

    If it is one to exclude you just don't write it, pick the next serial number from the exclude file and carry on.

    If you've gone past it, you just read from the exlude file until you've got a serial number higher than the the one you've just picked from the main file. Then, as before, just carry on.

    This alogrithm is single pass, and only has a memory overhead of two lines of text.

    Update: I mean something like this.

    #!/usr/bin/perl use warnings; use strict; open DATA, "data.txt" or die("Can't open file for reading: $1"); open EXCLUDE, "exclude.txt" or die("Can't open file for reading: $1"); open OUT, ">out.txt" or die("Can't open file for writing: $1"); my $exclude_serial = <EXCLUDE>; chomp($exclude_serial); my $line_data = <DATA>; while ($line_data) { chomp($line_data); my ($serial, $name, $flag) = split /,/, $line_data; # if we've run out of numbers to exclude just print the line to th +e outfile if (! defined $exclude_serial) { print OUT "$line_data\n"; # if we've not yet reached the serial to exclude, again just print + the line } elsif ($serial < $exclude_serial) { print OUT "$line_data\n"; # we must need a new exclude number then, pull it off the file, ke +eping track # of whether the current or subsequently read exclude serials mean + we shouldn't # print the current line } else { my $write_current_line = 1; # assume it's okay unless we find +a match do { $write_current_line = 0 if $exclude_serial == $serial; $exclude_serial = <EXCLUDE>; chomp($exclude_serial) if defined $exclude_serial; } until (! defined $exclude_serial or ($exclude_serial > $seri +al) ); print OUT "$line_data\n" if $write_current_line; } $line_data = <DATA>; }
    ---
    my name's not Keith, and I'm not reasonable.
      Keith has the best solution here. It's an old-fashioned match-merge. It's also the fastest: guaranteed O(n) (after the sort, of course).
Re: 15 billion row text file and row deletes - Best Practice?
by BrowserUk (Patriarch) on Dec 01, 2006 at 17:28 UTC

    Unless you have a second disk in which to write the output file, I wouldn't use two files.

    Besides the huge intermediate storage requirement which can be a problem for smaller systems, writing (and the OS locating space to write to) a disk, at the same time as you are reading from that disk can severly slow down the process. Especially if the disk is more than say 50% full and/or fragmented.

    I'd use a two pass process.

    1. Pass 1 reads the file and records the start&end pairing of each record to be deleted.

      If the kill list is small, load it all in a hash.

      If it's too big for memory, load the kill list in to a hash in chunks and write the positions to another file.

      The second pass only need the positions one at a time (once they are sorted).

    2. The second pass sorts the positions and then opens two filehandles to the data file. 1 for reading, 1 writing. It then reads via the former, writing to the latter, overwriting the bits that need deleting. Finally truncating the file (via the write handle) to it's final length.

    This is a little tricky to envisage, so a diagram might help.

    The records to be deleted A,J,O,R,Y,Z

    The initial state of the bigfile:    L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B

    The big file without the kill list:  L,  E,M,X,T,    G,U,W,  Q,  C,K,  P,V,I,D,H,N,S,F,B

    The desired result after 'compaction': L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B

    The following code uses a ramfile with single character records and ',' as the record separator to demostrate the logic required:

    #! perl -slw use strict; use Data::Dumper; use List::Util qw[ shuffle ]; ## Dummy up a hash of lines to kill. my %kills = map{ $_ => undef } ( shuffle( 'A' .. 'Z' ) )[ 0 .. 5 ]; print sort keys %kills; ## And a 'big file' my $bigfile = join ',', shuffle 'A' .. 'Z'; print $bigfile; ## Scan the bigfile recording the file positions ## where records are to be deleted. open my $fhRead, '<', \$bigfile; { local $/ = ','; my $lastPos = 0; while( <$fhRead> ) { chomp; ## Store the ranges (start & end) of each record to delete $kills{ $_ } = [ $lastPos, tell $fhRead ] if exists $kills{ $_ + }; $lastPos = tell $fhRead; } } ## Sort the ranges into ascending order my @posns = sort{ $a->[ 0 ] <=> $b->[ 0 ] } values %kills; ## Open a second write handle to the file open my $fhWrite, '+<', \$bigfile; { local $/ = ','; local $\; ## Move the file pointers for reading and writing ## to the end and start positions respectively my( $w1, $r1 ) = @{ shift @posns }; seek $fhWrite, $w1, 0; seek $fhRead, $r1, 0; while( @posns ) { ## Get the next set of positions my( $w2, $r2 ) = @{ shift @posns }; ## 'Move' the records up to the start of the next record to be + deleted print $fhWrite scalar <$fhRead> while tell( $fhRead ) < $w2; ## Advance the read head over the section to be removed. seek $fhRead, $r2, 0; } ## copy the residual remaining over print $fhWrite $_ while <$fhRead>; ## truncate the write filehandle. # truncate $fhWrite, tell $fhWrite; ## truncate doesn't work on ramfiles! $bigfile = substr $bigfile, 0, tell $fhWrite; } close $fhRead; close $fhWrite; print $bigfile; __END__ C:\test>junk HMNPRS L,R,J,Z,P,H,M,K,T,A,D,Q,B,Y,S,E,W,F,U,C,I,X,G,O,N,V L,J,Z,K,T,A,D,Q,B,Y,E,W,F,U,C,I,X,G,O,V C:\test>junk ABGKOT G,O,Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,A,B,L,N,T,K,F Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,L,N,F C:\test>junk AKLPTZ J,F,Q,S,E,X,I,B,U,K,R,A,M,W,Z,G,D,L,H,C,N,Y,O,V,T,P J,F,Q,S,E,X,I,B,U,R,M,W,G,D,H,C,N,Y,O,V, C:\test>junk AJORYZ L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Would it be faster to do relative seeks? I'm just thinking that toward the end of the file you're seeking across 300+GBs of data every time you delete a single line. would
      seek $fhRead, $r2 - $w2, 1;
      work?

      Just a thought...

      Edit: While it's obvious, i think it's also worth noting that if each record is the same size (which seems likely) then it can be done in one pass instead of two (or at least, one pass for each chunk of kill file, depending on kill file size).
        Would it be faster to do relative seeks?

        It would, but I think almost all modern implementations of seek would automatically translate an absolute seek into a relative seek anyway. I don't know for sure, but I find it hard to believe that an absolute seek would ever cause the OS to 'go back to the beginning of the file'.

        In reality, seeking simply adjusts a number somwhere and it's not until the next read or write that it has any affect at all.

        The third parameter to seek is simply a throwback to when files stored on tape were commonplace.

        With regard to the possibility that the file has fixed length records, you're right it would be much easier and faster if that were the case. But from the (scant) information provided by the OP, it seems unlikely. Whilst he shows the serial number padded with leading zeros, he also shows a name field, which is empty in the given example, but names vary in length, so unless they are all empty, it probably indicates that these are variable length records.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: 15 billion row text file and row deletes - Best Practice?
by jbert (Priest) on Dec 01, 2006 at 12:22 UTC
    You might also find that your handling of the file speeds up if you gzip it. If you're I/O limited, you'll have less I/O to do to read through the file.

    Of course, you'll need to make a pass through in order to gzip it, but hey, you'll save next time.

    In fact, to do anything with this file (other than append) you'll need to pass through pretty much the entire file. I don't know what you use it for (can you tell us?) but almost every operation on this file (apart from 'append new item') is going to be faster in a database with an index.

Re: 15 billion row text file and row deletes - Best Practice?
by talexb (Chancellor) on Dec 01, 2006 at 14:47 UTC

    Another strategy that no one's mentioned is to split the incoming file into smaller chunks and work on each chunk.

    But it looks like you have plenty approaches from which to choose -- although it would have been nice to know how large your kill file was, to put the problem into better perspective.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      Alex has a good point, you could put together a script to grab chunks of the big original file (say 1 million line chunks) and write that to another temp file. Then follow the method where you read the deletes into a hash, and parse through the temp file, appending the good lines to the final file. This way, you'll have 4 files at any one time: the original file, the delete file, the chunk temp file, and the final file. Then again, you'll be butting up against your disk space limit...

      Anyway, just a thought.

      tubaandy
Re: 15 billion row text file and row deletes - Best Practice?
by bsdz (Friar) on Dec 01, 2006 at 16:14 UTC
    How would plain old grep compare? Add your serial file as a "-f" argument to grep and pipe the output to a new file. You need to make sure you resultant file is 300GB or less due to your constraints.

    If you assume the name field is only 15 characters on average then a rough calculation is as follows. (Based on plain old ASCII).

    You need to lose at least 80GB from the unfiltered file. Your average record length is: -

       11  (serial) 
     + 1   (comma)
     + 15  (name)
     + 1   (comma)
     + 1   (y|n)
     + 1   (\n)
     = 30. (total bytes)
    
    Now 80GB = 85899345920 bytes and 80GB/30b (approx)= 2,863,311,531. That's how many records need to be in your kill list. I doubt grep could handle that.

    Interestingly, 380GB = 408021893120 that implies 380GB/30b (approx)= 13,600,729,771. There aren't even that many people in the world! So either you've left some fields out, my average name length assumption is wrong or each character in your file is more than a byte wide!

    PS: I am sure that there are plenty of holes in my calculations :)

    Update: Of course you said there were 15 billion rows!

Re: 15 billion row text file and row deletes - Best Practice?
by OfficeLinebacker (Chaplain) on Dec 01, 2006 at 15:38 UTC
    Greetings, esteemed monks!

    Interesting problem. Several very good and interesting suggestions. The only thing I would add is that if the kill list is big, it might make the process faster if, as you find and delete the dead serial numbers from the big file, you delete them from the "kill list" also. This might make tests of subsequent lines faster. I believe that reasonablekeith's suggestion is similar, but it's contingent on the files being sorted.

    I am not sure how well this would work; would he spend more time adding the code to delete the hash key (assuming he goes that route) than he would save? It would just be one extra line in the loop, and only executed if a deletion happened.

    I would NOT use a regex with more than a handful of alternations--now THAT I am pretty sure would be significantly slower than a simple hash key lookup.

    It's possible that time spent sorting both lists beforehand would be less than the time saved by the sort (ie sorting would be a good thing). A possible additional benefit of sorting would be that you could use an array to store the kill list (as opposed to a hash) and just increment the array index whenever you delete the currently indexed serial number (or (for a more robust approach if you might have numbers in the kill list that aren't in the big file) when the serial number read from the file is greater than the currently indexed serial number to kill.

    Also, if we're talking about spending time preparing the data to make the actual update process faster, the gzip idea might be of benefit, but I am less sure of that, especially if the big file is read in one line at a time.

    _________________________________________________________________________________

    I like computer programming because it's like Legos for the mind.

      Sorting 15 billion rows of text will be non trivial in terms of time and/or memory ;). I would drop the deleted SNs as you find them though.

      --Brig

        Brig,

        You're right. gzipping would be faster than sorting, though, right? Worth it?

        The thought of using DBD::CSV also crossed my mind. That would be using a DB interface to a CSV file. Don't know if that meets the OP's requirement not to use a DB.

        Great topic. ++

        _________________________________________________________________________________

        I like computer programming because it's like Legos for the mind.

Re: 15 billion row text file and row deletes - Best Practice?
by kabeldag (Hermit) on Dec 01, 2006 at 14:43 UTC
    While you have received some good suggestions, why the hell would you have a Data Base file of that size ?
    That's quite insane ... ( : - S )
Re: 15 billion row text file and row deletes - Best Practice?
by Anonymous Monk on Dec 01, 2006 at 23:32 UTC

    I would ask what other tasks you intend to use this file for? Is this a one time delete?

    I would hazard a guess that this file has a purpose outside of storing a bunch of serials that are only to be deleted from ;-)! Will someone be adding to this list, making changes, looking up entries in it? On a regular basis?

    I would tend to lean DB for anything like this (though not everything).

    May I suggest that before you take any further action (unless the need is immediate), you analyze what else you may do with this data. You might find the best answer for this problem in the general use of the data itself.

    -digiryde

Re: 15 billion row text file and row deletes - Best Practice?
by sgt (Deacon) on Dec 01, 2006 at 21:59 UTC

    nice thread. I would treat the file as a database

    a unix solution would be

  • cut the kill file in k chunks
  • fast string matching on serial file via grep
  • grep -nF -f chunk serial_file > delete_file # you need k steps!
  • perl gets line numbers n from delete_file, for each n susbstitute in serial the line by "X" x length $_

    if latter you add entries put it at the first X line that fits or else append

    (update) if you need to repeat the process a few times, maybe it's worth to sort the serial file

Re: 15 billion row text file and row deletes - Best Practice?
by serf (Chaplain) on Dec 02, 2006 at 15:34 UTC

    Wow! what an eye catching question, good one!

    I would be wary of thinking of using grep, as bsdz and sgt have mentioned.

    If you have a look at

    grep -vf exclude_file to_thin_file in perl

    you will see that Perl can do this much faster and with less memory than grep if the script is written efficiently.

    My workmate swears by DBD::CSV - but I haven't used it.

    Personally I think I'd feel safer writing to a new file in case anything went wrong while it was writing back - if it's running for a week that's a long time to risk it crashing!

Re: 15 billion row text file and row deletes - Best Practice?
by bsdz (Friar) on Dec 04, 2006 at 09:36 UTC
    No doubt you've probably already decided on a solution. In response to ideas that require a kill list loaded into memory (i.e. for example my grep suggestion): if one needs to remove 80GB of data due to the 300GB disk space constraint and there are 30 bytes per record and each serial number is 11 bytes. Then the amount of memory to load all those serial numbers into memory is: -

    (80GB / 30B) * 11B = 29.3333333GB

    (calculation courtesy of Google).

Re: 15 billion row text file and row deletes - Best Practice?
by mattr (Curate) on Dec 02, 2006 at 13:35 UTC
    Hi, BrowserUK's reply looks quite good.

    I just would like to chip in that if you are willing to chunk your data file in 100MB bites, based on my timing of unix sort on a 100MB file of 9 digit numbers it would cost you 100 hours to get sort all the data first. If you can at the same time ensure fixed length columns it will help.

    You really do not want to do any disk I/O since on my machine it takes 7 hours just to read the file once. If you have sorted kill list that is chunked according to the extents of each data chunk you also only have 1/40 of the kill list to worry about (if distribution is even).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://587112]
Approved by Samy_rio
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-18 05:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found