comment on

The simple Perl approach is to read the file and put all the 'keys' into a hash, which will tell you which are duplicated. But hashes are memory hungry, and it's not surprising that this won't work for your large files.

If you expect a relatively small number of duplicate keys, then the divide and conquer approach would be to eliminate as many not-duplicates as possible -- to get down to a set of keys that that can be held in a hash.

The following is what I have in mind:

read the original file, identify each key and run it through a digest function (the module String::CRC32 does a good job of returning a 32 bit digest). Use a bitmap to spot repeat digest values. This requires 512MB of memory. See sketch for how to do that, below. Record digest values that appear more than once.
discard the bitmap and load the recorded digest values into a hash.
read the original file again, identify each key and again run it through the digest function. Any key whose digest is not in the hash of digests, you can ignore. The remainder are potential duplicates, which you put into another hash in order to identify true duplicates.
now you have a hash from which you can extract the actual duplicate keys.

noting that how well this will work depends largely on how many duplicates there are.

By way of an estimate: the small sample you gave suggesting a line length of more than 100 bytes, so a file of ~2G contains ~2E7 keys. If the keys were all different, and using CRC32 as the digest, I would expect ~1E5 candidate digests to be collected by Step 1 -- that is 1/200 of the original keys -- which looks tractable. If only a small number of keys are duplicated, then this is a reasonable estimate of the memory requirement.

If 5% of the keys are repeated once each, then the number of real duplicates dominates. This would give ~1E6 keys, which I would expect to still be tractable -- even assuming 200-300 bytes per key, this is less than the 512M required for the bitmap.

However, if 20% of the keys are repeated once each, then that's ~4E6 keys in the final hash, which may still be a challenge memory-wise along side the hash containing the ~4E6 key digests.

use strict ; use warnings ;

# Dummies for quick test.

open my $FH, '<', &dummy_input or die "horribly $!" ;

sub extract_key {
  my ($line) = @_ ;
  return (split(/:/, $line))[0] ;
} ;

# In order to get a false positive out of the digest, this mask is use
+d.
# (Should be removed for real use !)

use constant SMALL_DIGEST => 0xA11A5885 ;

#---------------------------------------------------------------------
+--------------------
# Step 1.  Identify digests for possible duplicate keys.
#
#          NB: may identify the same digest value a number of times.

my $cand_file = '' ;
open my $CAND, '>', \$cand_file ;       # Can write to a real file if 
+required !

alloc_bm() ;                            # swallows 512M memory !

my $line_count = 0 ;
my $cand_count = 0 ;

while (my $line = <$FH>) {
  $line_count++ ;
  my $d = digest(extract_key($line)) ;

  if (setbit($d)) { print $CAND pack('N', $d) ; $cand_count++ ; } ;
} ;

dealloc_bm() ;

close $CAND ;

print STDERR "Read $line_count lines and identified $cand_count candid
+ate digests\n" ;


#---------------------------------------------------------------------
+--------------------
# Step 2.  Construct the candidate hash.
#
#          NB: this is assuming that there will be a relatively small 
+number of 
#              candidates -- in particular assuming that a hash will b
+e much smaller
#              than another bitmap !

my %candidates = () ;
my $cand_check = 0 ;

open $CAND, '<', \$cand_file ;

while (1) {
  my $ds ;
  my $rc = read $CAND, $ds, 4 ;
  if (!defined($rc)) { die "failed reading candidates $!" ;      } ;
  last if $rc == 0 ; 
  if ($rc != 4)      { die "unexpected EOF reading candidates" ; } ;

  $cand_check++ ;

  $candidates{$ds} = undef ;
} ;

print STDERR "Read $cand_check candidate digests\n" ;

if ($cand_check != $cand_count) { die "expected $cand_count candidates
+" ; } ;

close $CAND ;

$cand_file = undef ;            # Release memory !

#---------------------------------------------------------------------
+--------------------
# Step 3.  Identify the duplicate keys and related records.
#
#          NB: this is assuming that this is now a small enough proble
+m !

my %key_hash = () ;

my $line_number = 0 ;

seek $FH, 0, 0   or die "failed to seek to start of input $!" ;

while (my $line = <$FH>) {
  my $key = extract_key($line) ;
  my $ds  = pack('N', digest($key)) ;

  if (exists($candidates{$ds})) {
    push @{$key_hash{$key}}, $line_number ;
  } ;

  $line_number++ ;
} ;

if ($line_number != $line_count) {
  die "Pass 1 read $line_count lines.  Pass 2 read $line_number !" ;
} ;

close $FH ;

#---------------------------------------------------------------------
+--------------------
# Step 4.  Discard the not duplicates after all.
#
#          NB: we are here discarding keys which had the same digest a
+s some other key,
#              but are not equal to any other key.

my $discard = 0 ;

foreach my $key (keys %key_hash) {
  if (@{$key_hash{$key}} == 1) {
    $discard++ ;
    delete $key_hash{$key} ;
  } ;
} ;

print STDERR "Discarded $discard 'not duplicate after all' keys\n" ;

#---------------------------------------------------------------------
+--------------------
# Step 5.  Process the duplicates
#
#          %key_hash now contains the actual duplicate keys, and for e
+ach one a list of
#          line numbers for the key.

print "Indentified:\n" ;

foreach my $k (sort keys(%key_hash)) {
  print STDERR "$k: ", join(', ', @{$key_hash{$k}}), "\n" ;
} ;

######################################################################
+####################
#
# Bit Map Handling -- 32-bit address space
#
# We divide the bitmap up into chunks in an array -- to keep individua
+l vec strings to
# some 'reasonable' size.
#
#    BMX = number of bits to identify which chunk
#    BMM = mask to extract chunk number
#
#    BM_COUNT  = number of chunks
#    BM_SIZE   = size of each chunk, in bytes

use constant BITS => 32 ;
use constant BMS  => 16 ;        # 2**16 = 64K -- size of chunk in BYT
+ES

use constant BMX  => BMS + 3 ;        # Shift to get chunk index
use constant BMM  => (1 << BMX) - 1 ;    # Mask to get bit index

use constant BM_COUNT  =>  1 << (BITS - BMX) ;
                    # Number or chunks in bitmap
use constant BM_SIZE   =>  1 << BMS ;    # Size of chunk in BYTES

my @bitmap       = () ;            # The bitmap

use String::CRC32 qw(crc32) ;           # our hash/digest function

#=====================================================================
+====================
# alloc_bm: allocate empty bitmap
#
# Requires: Nothing
#
# Returns:  Nothing

sub alloc_bm {
  $#bitmap = BM_COUNT - 1 ;         # pre-extend the array
  for my $i (0..BM_COUNT - 1) {     # fill array with zeroed vec strin
+gs
    $bitmap[$i] = "\0" x BM_SIZE ;
  } ;
} ;

#=====================================================================
+====================
# dealloc_bm: deallocate bitmap
#
# Requires: Nothing
#
# Returns:  Nothing

sub dealloc_bm {
  undef @bitmap ;
} ;

#=====================================================================
+====================
# digest: calculate digest
#
# Requires: $key     -- key string
#
# Returns:  $d       -- digest value: 32bits of unsigned integer

sub digest {
  my ($key) = @_ ;
  return crc32($key) & SMALL_DIGEST ;
} ;

#=====================================================================
+====================
# setbit: set bit in bitmap, and return whether was set before.
#
# Requires: $d     -- bitmap index
# 
# Returns:  0 or 1 -- previous bitmap value

sub setbit {
  my ($d) = @_ ;

  my $vec = $d >> BMX ;
  my $bit = $d & BMM ;

  my $was = vec($bitmap[$vec], $bit, 1) ;

  if (!$was) { vec($bitmap[$vec], $bit, 1) = 1 ; } ;

  return $was ;
} ;

######################################################################
+####################
# Dummy input
sub dummy_input { \(<<'~~FILE') }
abcdefg: line  0
gdshdff: line  1
djdkled: line  2
hhqmshq: line  3 -- false duplicate (with SMALL_DIGEST)
zyxwvut: line  4
abcdefg: line  5 -- duplicate
bbewyeb: line  6
piuewpi: line  7
zyxwvut: line  8 -- duplicate
hfdjhjh: line  9
abcdefg: line 10 -- duplicate
lsjquor: line 11 -- false duplicate (with SMALL_DIGEST)
shargfd: line 12
~~FILE
######################################################################
+####################
[download]

In reply to Re: Huge files manipulation by gone2015
in thread Huge files manipulation by klashxx

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks