http://qs321.pair.com?node_id=722634

klashxx has asked for the wisdom of the Perl Monks concerning the following question:


Hi , i need a fast way to delete duplicates entrys from very hugefiles ( >2 Gbs ) , these files are in plain text.
..To clarify, this is the structure of the file:
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|01|F|0207|00|||+0005655,00|||+0000000000000,00 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|01|F|0207|00|||+0000000000000,00|||+0000000000000,00 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|1804|00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but it always ended with the same result (out of memory!)

In using HP-UX large servers.

I 'm very new to perl, but i read somewhere tha Tie::File module can handle very large files , i tried but cannot get the right code...

Any advice will be very well come.

Thank you in advance.

Regards

PD:I do not want to split the files.

Replies are listed 'Best First'.
Re: Huge files manipulation
by BrowserUk (Patriarch) on Nov 10, 2008 at 14:33 UTC

    One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards). If that's the case, this might work for you.

    The problem with uniqing huge files with Perl, is the memory footprint of the hash required to remember all the records. And you cannot partition the dataset by record number (first N; next N; etc.), unless the records are sorted, because you need to process the whole dataset. What's needed is an alternative way of partitioning the dataset that allows the uniqing to work but without loosing the original ordering.

    One way of doing this is to make multiple passes, and only consider some subset of the records during each pass. A simple partitioning mechanism is to use the first character (or n characters) of each record. For example, if all your records start with a digit, 0-9, then you can make 10 passes and only consider those that start with each digit during each pass. This reduces the memory requirement for the hash to 1/10th.

    If your record start with alpha characters, then you get a natural split into 26 passes.

    If the single character partition is still too large, use the first two digits/characters for a split into 100/676 passes.

    If the numbers of passes is more than needed, the you can choose 'AB' for the first pass and 'CD' for the second. And so on.

    You record the file offsets of each line that needs to be discarded, (on the basis that you are likely to be discarding fewer records than you are going to retain), and then sort these offsets numerically, and make a final sequential pass, checking the offset against the first offset in the discards array and only output it if it does not match. Once you found a discard, you shift that offset off the discards array and continue.

    The following code assumes the records can start with any alpha-numeric character, and so will make a total of 63 passes. Tailor it to suit your data:

    #! perl -sw use strict; open FILE, '<', $ARGV[ 0 ] or die $!; my @discards; ## offsets of lines to dicard my %hash; ## Remembers records seen during each pass ## IMPORTANT: Tailor the follwing to your data! ## If all your records start with a digit (0-9) ## As shown, this would make 52 redundant passes! ## During each pass, only consider lines that start with this char for my $firstChar ( 0..9, 'A'..'Z', 'a'..'z' ) { warn "Pass $firstChar\n"; my $offset = 0; ##offset starts a zero for each pass while( <FILE> ) { if( m[^$firstChar] ) { ## If the line starts with the passes + char ## Form the key from the first 7 fields my $key = join'|', ( split '\|', $_, 8 )[ 0 .. 6 ]; ## If doesn't exist yet, remember it if( not exists $hash{ $key } ) { $hash{ $key } = undef; } ## if it does add it's offset to the discards array else { push @discards, $offset; ## The offset for the last li +ne read } } $offset = tell FILE; ## Remember this offset } undef %hash; ## Empty the hash seek FILE, 0, 0; ## And reset to the start after each pass } printf STDERR "Sorting %d discards\n", scalar @discards; ## Sort the list of discard offsets ascending @discards = sort { $a <=> $b } @discards; ## Final pass while( my $line = <FILE> ) { ## if the line we just read leaves the offset equal ## to the next discard, while( @discards and tell FILE == $discards[ 0] ) { $line = <FILE>; ## discard the line and get the next shift @discards; ## And that offset } ## now output the good line (for redirection) print $line; } close FILE;

    Usage: HugeUniq.pl hugeFile > uniqHugeFile

    Takes about 40 minutes to process a 6 million record/2GB file on my system using 26 passes.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      One reason for not using sort -u or uniq commands is if you wish to retain the original ordering (minus the discards).

      As said earlier in the thread, if you want to keep ordering, just add the line number, sort, uniquify, sort on the line number and cut. Or, as a one liner:

      nl -s '|' file_with_dups | sort -k 2,8 -t '|' -u | sort -nb | cut -d ' +|' -f 2- > file_without_dups
      This is a standard trick.
        This is a standard trick.

        Yes. And it requires 4 passes of the dataset, including two full sorts.

        And takes 5 or 6 times as long as the script above, if I use 5 passes ('a-e','f-j', 'k-o', 'p-t', 'u-z') on the same dataset.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Huge files manipulation
by johngg (Canon) on Nov 10, 2008 at 14:09 UTC

    If you want the output file to contain only the first instance of each key in the order found in the input file you could try processing the input file line by line. The script keeps track of the keys encountered in the %seen hash and only prints a record to the output file if it hasn't been seen before. If there are so many unique keys that this hash starts causing resource problems you could tie it to a disk-based DBM such as Berkeley DB or GDBM.

    Given the input in your OP, this code

    use strict; use warnings; my $inFile = q{spw722634.in}; open my $inFH, q{<}, $inFile or die qq{open: < $inFile: $!\n}; my $outFile = q{spw722634.out}; open my $outFH, q{>}, $outFile or die qq{open: > $outFile: $!\n}; my %seen = (); while( <$inFH> ) { my $key = join q{}, ( split m{\|}, $_, 8 )[ 0 .. 6 ]; print $outFH $_ unless $seen{ $key } ++; } close $inFH or die qq{close: < $inFile: $!\n}; close $outFH or die qq{close: > $outFile: $!\n};

    produces an output file with these records

    30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|0 +1|F|0207|00|||+0005655,00|||+0000000000000,00 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|1804 +|00|||+0000000000000,00|||+0000000000000,00

    I hope this is the sort of solution you are aiming for and that you find this of use.

    Cheers,

    JohnGG

Re: Huge files manipulation
by JavaFan (Canon) on Nov 10, 2008 at 12:18 UTC
    I find it very surprising that sort end with an out of memory, as sort dates from the time memory was tiny compared to current times, and back then it was able to sort huge files - sort will use temporary files to avoid having to use more memory than the OS is able to give it.

    Your problem sounds like a task 'sort -u' was made for.

      Unix and gnu sort rely on having an adequate amount of available disk space on /tmp; if the OP is using a machine with insufficient free space on /tmp,  sort -u will fail.
        Only if you haven't told sort if can use another directory. (GNU) sort will either use the directory given with the -T option, the directory in the TMPDIR environment variable, or /tmp. In that order. The POSIX standard mentions that -T is undocumented, but existing in some implementations, and it encourages implementors to support the TMPDIR environment variable.

      I'm not surprised at all. HPUX is not GNU/Linux. The GNU tools are much more flexible and built to overcome the limitations found in many standard UNIX tools.

      As an example, I remember using ls in a directory with many files (>64k, oops) and HP's ls just barfed. Of course GNU's ls has no such problem. If it's optional in the standard, GNU has it but HPUX probably doesn't.

      Disclaimer: I haven't run any "big iron" in over 5 years so things may have changed.

Re: Huge files manipulation
by educated_foo (Vicar) on Nov 10, 2008 at 12:09 UTC
    You need to do some sort of divide-and-conquer approach. For example, you could split up the data into a bunch of temporary files according to the first 2-3 fields, uniquify each of those files, then cat them back together into one big file.
Re: Huge files manipulation
by mirod (Canon) on Nov 10, 2008 at 12:33 UTC

    I am curious to know why is sort (the unix tool) not working? If the file is unsorted (on the composite key) and you care about the order of the lines, why not add the line number, sort -u on the key, sort on the line number and then cut the line number?

    If the problem is getting the key using the admittedly obscure sort options, you can have a perl (or awk, or whatever) script gather the complete key and add it as an extra field on each line, so that sort can find it easily. Incidentally, this is very similar to a Schwartzian transform.

Re: Huge files manipulation
by gone2015 (Deacon) on Nov 10, 2008 at 20:35 UTC

    The simple Perl approach is to read the file and put all the 'keys' into a hash, which will tell you which are duplicated. But hashes are memory hungry, and it's not surprising that this won't work for your large files.

    If you expect a relatively small number of duplicate keys, then the divide and conquer approach would be to eliminate as many not-duplicates as possible -- to get down to a set of keys that that can be held in a hash.

    The following is what I have in mind:

    1. read the original file, identify each key and run it through a digest function (the module String::CRC32 does a good job of returning a 32 bit digest). Use a bitmap to spot repeat digest values. This requires 512MB of memory. See sketch for how to do that, below. Record digest values that appear more than once.

    2. discard the bitmap and load the recorded digest values into a hash.

    3. read the original file again, identify each key and again run it through the digest function. Any key whose digest is not in the hash of digests, you can ignore. The remainder are potential duplicates, which you put into another hash in order to identify true duplicates.

    4. now you have a hash from which you can extract the actual duplicate keys.

    noting that how well this will work depends largely on how many duplicates there are.

    By way of an estimate: the small sample you gave suggesting a line length of more than 100 bytes, so a file of ~2G contains ~2E7 keys. If the keys were all different, and using CRC32 as the digest, I would expect ~1E5 candidate digests to be collected by Step 1 -- that is 1/200 of the original keys -- which looks tractable. If only a small number of keys are duplicated, then this is a reasonable estimate of the memory requirement.

    If 5% of the keys are repeated once each, then the number of real duplicates dominates. This would give ~1E6 keys, which I would expect to still be tractable -- even assuming 200-300 bytes per key, this is less than the 512M required for the bitmap.

    However, if 20% of the keys are repeated once each, then that's ~4E6 keys in the final hash, which may still be a challenge memory-wise along side the hash containing the ~4E6 key digests.

Re: Huge files manipulation
by blazar (Canon) on Nov 11, 2008 at 12:20 UTC

    I personally believe that have been nice of you to mention that you asked the very same question elsewhere. (Also available for webby guys!) Did they help you there? How did they fail to do so?

    --
    If you can't understand the incipit, then please check the IPB Campaign.