Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Huge files manipulation

by gone2015 (Deacon)
on Nov 10, 2008 at 20:35 UTC ( [id://722718]=note: print w/replies, xml ) Need Help??


in reply to Huge files manipulation

The simple Perl approach is to read the file and put all the 'keys' into a hash, which will tell you which are duplicated. But hashes are memory hungry, and it's not surprising that this won't work for your large files.

If you expect a relatively small number of duplicate keys, then the divide and conquer approach would be to eliminate as many not-duplicates as possible -- to get down to a set of keys that that can be held in a hash.

The following is what I have in mind:

  1. read the original file, identify each key and run it through a digest function (the module String::CRC32 does a good job of returning a 32 bit digest). Use a bitmap to spot repeat digest values. This requires 512MB of memory. See sketch for how to do that, below. Record digest values that appear more than once.

  2. discard the bitmap and load the recorded digest values into a hash.

  3. read the original file again, identify each key and again run it through the digest function. Any key whose digest is not in the hash of digests, you can ignore. The remainder are potential duplicates, which you put into another hash in order to identify true duplicates.

  4. now you have a hash from which you can extract the actual duplicate keys.

noting that how well this will work depends largely on how many duplicates there are.

By way of an estimate: the small sample you gave suggesting a line length of more than 100 bytes, so a file of ~2G contains ~2E7 keys. If the keys were all different, and using CRC32 as the digest, I would expect ~1E5 candidate digests to be collected by Step 1 -- that is 1/200 of the original keys -- which looks tractable. If only a small number of keys are duplicated, then this is a reasonable estimate of the memory requirement.

If 5% of the keys are repeated once each, then the number of real duplicates dominates. This would give ~1E6 keys, which I would expect to still be tractable -- even assuming 200-300 bytes per key, this is less than the 512M required for the bitmap.

However, if 20% of the keys are repeated once each, then that's ~4E6 keys in the final hash, which may still be a challenge memory-wise along side the hash containing the ~4E6 key digests.

use strict ; use warnings ; # Dummies for quick test. open my $FH, '<', &dummy_input or die "horribly $!" ; sub extract_key { my ($line) = @_ ; return (split(/:/, $line))[0] ; } ; # In order to get a false positive out of the digest, this mask is use +d. # (Should be removed for real use !) use constant SMALL_DIGEST => 0xA11A5885 ; #--------------------------------------------------------------------- +-------------------- # Step 1. Identify digests for possible duplicate keys. # # NB: may identify the same digest value a number of times. my $cand_file = '' ; open my $CAND, '>', \$cand_file ; # Can write to a real file if +required ! alloc_bm() ; # swallows 512M memory ! my $line_count = 0 ; my $cand_count = 0 ; while (my $line = <$FH>) { $line_count++ ; my $d = digest(extract_key($line)) ; if (setbit($d)) { print $CAND pack('N', $d) ; $cand_count++ ; } ; } ; dealloc_bm() ; close $CAND ; print STDERR "Read $line_count lines and identified $cand_count candid +ate digests\n" ; #--------------------------------------------------------------------- +-------------------- # Step 2. Construct the candidate hash. # # NB: this is assuming that there will be a relatively small +number of # candidates -- in particular assuming that a hash will b +e much smaller # than another bitmap ! my %candidates = () ; my $cand_check = 0 ; open $CAND, '<', \$cand_file ; while (1) { my $ds ; my $rc = read $CAND, $ds, 4 ; if (!defined($rc)) { die "failed reading candidates $!" ; } ; last if $rc == 0 ; if ($rc != 4) { die "unexpected EOF reading candidates" ; } ; $cand_check++ ; $candidates{$ds} = undef ; } ; print STDERR "Read $cand_check candidate digests\n" ; if ($cand_check != $cand_count) { die "expected $cand_count candidates +" ; } ; close $CAND ; $cand_file = undef ; # Release memory ! #--------------------------------------------------------------------- +-------------------- # Step 3. Identify the duplicate keys and related records. # # NB: this is assuming that this is now a small enough proble +m ! my %key_hash = () ; my $line_number = 0 ; seek $FH, 0, 0 or die "failed to seek to start of input $!" ; while (my $line = <$FH>) { my $key = extract_key($line) ; my $ds = pack('N', digest($key)) ; if (exists($candidates{$ds})) { push @{$key_hash{$key}}, $line_number ; } ; $line_number++ ; } ; if ($line_number != $line_count) { die "Pass 1 read $line_count lines. Pass 2 read $line_number !" ; } ; close $FH ; #--------------------------------------------------------------------- +-------------------- # Step 4. Discard the not duplicates after all. # # NB: we are here discarding keys which had the same digest a +s some other key, # but are not equal to any other key. my $discard = 0 ; foreach my $key (keys %key_hash) { if (@{$key_hash{$key}} == 1) { $discard++ ; delete $key_hash{$key} ; } ; } ; print STDERR "Discarded $discard 'not duplicate after all' keys\n" ; #--------------------------------------------------------------------- +-------------------- # Step 5. Process the duplicates # # %key_hash now contains the actual duplicate keys, and for e +ach one a list of # line numbers for the key. print "Indentified:\n" ; foreach my $k (sort keys(%key_hash)) { print STDERR "$k: ", join(', ', @{$key_hash{$k}}), "\n" ; } ; ###################################################################### +#################### # # Bit Map Handling -- 32-bit address space # # We divide the bitmap up into chunks in an array -- to keep individua +l vec strings to # some 'reasonable' size. # # BMX = number of bits to identify which chunk # BMM = mask to extract chunk number # # BM_COUNT = number of chunks # BM_SIZE = size of each chunk, in bytes use constant BITS => 32 ; use constant BMS => 16 ; # 2**16 = 64K -- size of chunk in BYT +ES use constant BMX => BMS + 3 ; # Shift to get chunk index use constant BMM => (1 << BMX) - 1 ; # Mask to get bit index use constant BM_COUNT => 1 << (BITS - BMX) ; # Number or chunks in bitmap use constant BM_SIZE => 1 << BMS ; # Size of chunk in BYTES my @bitmap = () ; # The bitmap use String::CRC32 qw(crc32) ; # our hash/digest function #===================================================================== +==================== # alloc_bm: allocate empty bitmap # # Requires: Nothing # # Returns: Nothing sub alloc_bm { $#bitmap = BM_COUNT - 1 ; # pre-extend the array for my $i (0..BM_COUNT - 1) { # fill array with zeroed vec strin +gs $bitmap[$i] = "\0" x BM_SIZE ; } ; } ; #===================================================================== +==================== # dealloc_bm: deallocate bitmap # # Requires: Nothing # # Returns: Nothing sub dealloc_bm { undef @bitmap ; } ; #===================================================================== +==================== # digest: calculate digest # # Requires: $key -- key string # # Returns: $d -- digest value: 32bits of unsigned integer sub digest { my ($key) = @_ ; return crc32($key) & SMALL_DIGEST ; } ; #===================================================================== +==================== # setbit: set bit in bitmap, and return whether was set before. # # Requires: $d -- bitmap index # # Returns: 0 or 1 -- previous bitmap value sub setbit { my ($d) = @_ ; my $vec = $d >> BMX ; my $bit = $d & BMM ; my $was = vec($bitmap[$vec], $bit, 1) ; if (!$was) { vec($bitmap[$vec], $bit, 1) = 1 ; } ; return $was ; } ; ###################################################################### +#################### # Dummy input sub dummy_input { \(<<'~~FILE') } abcdefg: line 0 gdshdff: line 1 djdkled: line 2 hhqmshq: line 3 -- false duplicate (with SMALL_DIGEST) zyxwvut: line 4 abcdefg: line 5 -- duplicate bbewyeb: line 6 piuewpi: line 7 zyxwvut: line 8 -- duplicate hfdjhjh: line 9 abcdefg: line 10 -- duplicate lsjquor: line 11 -- false duplicate (with SMALL_DIGEST) shargfd: line 12 ~~FILE ###################################################################### +####################

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://722718]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2024-04-19 08:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found