Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Find duplicate lines from the file and write it into new file.

by anna_here (Initiate)
on Jan 04, 2007 at 13:13 UTC ( [id://592934]=perlquestion: print w/replies, xml ) Need Help??

anna_here has asked for the wisdom of the Perl Monks concerning the following question:

I have one huge file (i.e 4GB file size). I want to findout duplicate lines from that file and write it into one new file. I wrote one coding but I have 'Out of memory Error'
  • Comment on Find duplicate lines from the file and write it into new file.

Replies are listed 'Best First'.
Re: Find duplicate lines from the file and write it into new file.
by Corion (Patriarch) on Jan 04, 2007 at 13:26 UTC

    The traditional approach is to sort your file first, then all duplicates will appear one after another. To sort a huge amount of data, the best approach is to divide it up into small parts and sort these first and then do a Merge Sort from the sorted parts. During that merging you can also already output the duplicates easily.

      if sorting is an option, and we are on unix/linux here, i would try out sort -u file

      update: yep, i misread the question. forget it =)

        but then you would lose the duplicates, i think the OP wanted to find the duplicates, perhaps somthing along the lines of sort | uniq -c , then they could read the file and look at the counts



        This is not a Signature...
Re: Find duplicate lines from the file and write it into new file.
by pajout (Curate) on Jan 04, 2007 at 13:20 UTC
    Try to use BerkeleyDB, which allows you transparently manipulate a hash stored in the file.
Re: Find duplicate lines from the file and write it into new file.
by davidrw (Prior) on Jan 04, 2007 at 13:54 UTC
    is the file sorted? if so, a simple uniq -d foo.txt > dups.txt will do it..
    I wrote one coding but I have 'Out of memory Error'
    What's the code look like? Someone here may be able to provide suggestions on how to make it more efficient if you post the snippet.
Re: Find duplicate lines from the file and write it into new file.
by jettero (Monsignor) on Jan 04, 2007 at 13:22 UTC

    That could be a hard problem, depending on what you need the filtered data for. My usual method is to use a $hash{$line}++ type method to find dupes, but that's going to eat a lot of ram (the problem you're having now I guess) unless the lines have some identifier that you can use instead.

    One option might be to build a $digest = Digest::MD5->new() and count the digests. That could be really computationally expensive, especially for a file that big. I guess it depends what the lines look like. Would it be possible to include a few sample lines?

    -Paul

      Hashing might help but you have to consider the risk of collisions. If records are all the same size, then seeking for comparisons can be done in linear time, but the hit for actually doing the comparisons will be a function of their how many collisions there are -- and the smaller your hashspace, the more of those there will be.

      If you don't know anything about a typical record, I think the mergesort approach is a good shot.

        Yeah, without knowing anything about the input file I was really just brainstorming... In my experience a 4gig file is probably a log file or something and collisions may not matter very much — it really dpeneds on what the filtered output is intended to do.

        -Paul

Re: Find duplicate lines from the file and write it into new file.
by Melly (Chaplain) on Jan 04, 2007 at 15:16 UTC

    The two obvious approaches (well, ignoring tied hashes, dbs, etc.) are both crippled - memory requirements will prevent you reading in the whole file (as you've already found), while a line-by-line approach (read a line, check the rest of the file for a duplicate) would be crippling in terms of time... so why not try a hybrid - read in, say, 10,000 lines, check the rest of the file for duplicates, then read the next 10,000?

    The following code is not complete, or fully tested (e.g. EOF handling?).

    use strict; my $file = 'data.txt'; my $thiscount; my $fullcount; my $max = 10; # change this to, say, 10000 my %lines; open(INPUT, $file); while(<INPUT>){ chomp; if(exists $lines{$_}){ print "duplicate line (on read):$_\n"; } else{ $lines{$_} = 1; } $thiscount ++; $fullcount ++; if($thiscount >= $max){ my $checkcount=0; open(CHECK, $file); while(<CHECK>){ $checkcount ++; if($checkcount > $fullcount){ chomp; if(exists $lines{$_}){ print "duplicate line (on check):$_\n"; } } } undef %lines; $thiscount = 0; } }
    map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
    Tom Melly, pm@tomandlu.co.uk
Re: Findout duplicate lines from the file and write it into new file.
by tinita (Parson) on Jan 04, 2007 at 13:26 UTC
    to check if a line has been seen before you need to remember all lines you read somehow.

    but first you could just post the code you tried out here because people don't want to post answers only to see that your code already did that.

    an untested trial from me would be: perl -pi -e' $_ = "" if $seen{$_}++' file
    of course the hash %seen will get quite big (unless you have very much duplicated).

    edit: oops, you want to print out the duplicates.
    perl -ne' print if $seen{$_}++ == 1; ' file > newfile

Re: Find duplicate lines from the file and write it into new file.
by gam3 (Curate) on Jan 04, 2007 at 15:53 UTC
    If the lines are long Digest::MD5 might be able to solve your problem.
    #!/usr/bin/perl use strict; use Digest::MD5; my $file = shift; my ($input, $check); open($input, $file) or die "Could not open $file: $!"; open($check, $file) or die "Could not open $file: $!"; my %hash; while (!eof($input)) { my $location = tell($input); my $line = readline($input); chomp $line; my $digest = Digest::MD5::md5($line); # $digest = length($line); if (defined(my $ll = $hash{$digest})) { my $d = 0; for my $l (@$ll) { seek($check, $l, 0); my $checkl = readline($check); chomp $checkl; if ($checkl eq $line) { print "DUP $line\n"; $d = 1; last; } } if ($d == 0) { push(@{$hash{$digest}}, $location); } } else { push(@{$hash{$digest}}, $location); } }
    The seek is really over kill in this case, but would be needed if you used a checksum in place of the Digest::MD5 method.

    Note: This will only save memory if the average line length is longer than 16 bytes.

    UPDATE: Changed code to correctly handle problem pointed out by Corion.

    Solution for wojtyk. This will let you have up to 256 passes. It does assume that there is a random distribution of the first byte of the Digest.

    #!/usr/bin/perl use strict; use Digest::MD5; my $file = shift; my ($input, $check); open($input, $file) or die "Could not open $file: $!"; open($check, $file) or die "Could not open $file: $!"; my %hash; my $passes = 2; for (my $pass = 0; $pass < $passes; $pass++) { while (!eof($input)) { my $location = tell($input); my $line = readline($input); chomp $line; my $digest = Digest::MD5::md5($line); my $p = ord($digest); if ($p % $passes != $pass) { next; } if (defined(my $ll = $hash{$digest})) { my $d = 0; for my $l (@$ll) { seek($check, $l, 0); my $checkl = readline($check); chomp $checkl; if ($checkl eq $line) { print "DUP $line\n"; $d = 1; last; } } if ($d == 0) { push(@{$hash{$digest}}, $location); } } else { push(@{$hash{$digest}}, $location); } } seek($input, 0, 0); }
    -- gam3
    A picture is worth a thousand words, but takes 200K.

      This will fail for "second" duplicates that map to the same MD5:

      Assume that two inputs i1 and i2 map to the same hash (m1), then your code will correctly discard all duplicate instances of i1 but incorrectly keep all instances of i2 for the following input:

      i1 i2 i2 i2

      You will need to keep a list of the offsets for each MD5, at least for every unique line encountered.

      I think the Digest method is ideal if you have the memory for it. If 16 bytes per line still consumes too much memory, you could take a piecewise approach.

      Divide available memory by the number of lines in the file (or more accurately, divide by the average line length if you want to waste time on the computation)
      That should give you the approximate amount of memory you have per line to work with (roughly...you'll probably need to adjust for overhead).
      If that number is 16 bytes or greater, just use the digest method. If not, do multiple passes doing piecewise duplicate checking,

      For instance, if it turns out you only have 6 bytes of memory available per line, well then do one run of the data where you treat only the first 6 characters of the line as the line (ie, store the first 6 characters of the line in the hash, and use that to check for dups against the first 6 characters of each line of the rest of the file).
      Use that to create a new file.
      That should produce a smaller file of lines that have duplicates amidst the first 6 characters.
      Since you now have a smaller (but still lossless) file to work with, you can then run another sweep on the new file checking a larger number of characters.
      Repeat until accurate.
      It's ugly, and disk expensive, but if you really don't have the memory available, it may be the only way to accomplish the task.

Re: Find duplicate lines from the file and write it into new file.
by shigetsu (Hermit) on Jan 04, 2007 at 13:26 UTC
    #!/usr/bin/perl use strict; use warnings; while (<>) { chomp; $duplicates{$_}++; } foreach my $key (keys %duplicates) { if ($duplicates{$key} > 1) { delete $duplicates{$key}; print "$key\n"; } }
    Expects to be run as script < inputfile > outputfile
      Slightly more efficient....

      #!/usr/bin/perl -w use strict; use warnings; my %duplicates; while (<DATA>) { print if !defined $duplicates{$_}; $duplicates{$_}++; } __DATA__ The quick red fox jumped over the lazy brown dog. Time flies like an arrow, fruit flies like a banana. Time flies like an arrow, fruit flies like a banana. Now is the time for all good men to come to the aid of their party. The quick red fox jumped over the lazy brown dog. The quick red fox jumped over the lazy brown dog. Time flies like an arrow, fruit flies like a banana. Now is the time for all good men to come to the aid of their party. Now is the time for all good men to come to the aid of their party.

      --roboticus

        I used the script to find duplicate rows in my log file and it worked well. Thanks! However, it would be great if this script could be updated to display how may times the rows have been repeated as well. Please help!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://592934]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-25 14:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found