Find duplicate lines from the file and write it into new file.

anna_here has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Find duplicate lines from the file and write it into new file. by Corion (Patriarch) on Jan 04, 2007 at 13:26 UTC
The traditional approach is to sort your file first, then all duplicates will appear one after another. To sort a huge amount of data, the best approach is to divide it up into small parts and sort these first and then do a Merge Sort from the sorted parts. During that merging you can also already output the duplicates easily.	[reply]
Re^2: Find duplicate lines from the file and write it into new file. by tinita (Parson) on Jan 04, 2007 at 13:28 UTC
if sorting is an option, and we are on unix/linux here, i would try out `sort -u file` update: yep, i misread the question. forget it =)	[reply] [d/l]
Re^3: Find duplicate lines from the file and write it into new file. by monkey_boy (Priest) on Jan 04, 2007 at 13:34 UTC
but then you would lose the duplicates, i think the OP wanted to find the duplicates, perhaps somthing along the lines of `sort \| uniq -c` , then they could read the file and look at the counts This is not a Signature...	[reply] [d/l] [select]
Re^4: Find duplicate lines from the file and write it into new file. by davidrw (Prior) on Jan 04, 2007 at 13:51 UTC
Re: Find duplicate lines from the file and write it into new file. by pajout (Curate) on Jan 04, 2007 at 13:20 UTC
Try to use BerkeleyDB, which allows you transparently manipulate a hash stored in the file.	[reply]
Re: Find duplicate lines from the file and write it into new file. by davidrw (Prior) on Jan 04, 2007 at 13:54 UTC
is the file sorted? if so, a simple `uniq -d foo.txt > dups.txt` will do it.. I wrote one coding but I have 'Out of memory Error' What's the code look like? Someone here may be able to provide suggestions on how to make it more efficient if you post the snippet.	[reply] [d/l]
Re: Find duplicate lines from the file and write it into new file. by jettero (Monsignor) on Jan 04, 2007 at 13:22 UTC
That could be a hard problem, depending on what you need the filtered data for. My usual method is to use a `$hash{$line}++` type method to find dupes, but that's going to eat a lot of ram (the problem you're having now I guess) unless the lines have some identifier that you can use instead. One option might be to build a `$digest = Digest::MD5->new()` and count the digests. That could be really computationally expensive, especially for a file that big. I guess it depends what the lines look like. Would it be possible to include a few sample lines? -Paul	[reply] [d/l] [select]
Re^2: Find duplicate lines from the file and write it into new file. by gaal (Parson) on Jan 04, 2007 at 14:07 UTC
Hashing might help but you have to consider the risk of collisions. If records are all the same size, then seeking for comparisons can be done in linear time, but the hit for actually doing the comparisons will be a function of their how many collisions there are -- and the smaller your hashspace, the more of those there will be. If you don't know anything about a typical record, I think the mergesort approach is a good shot.	[reply]
Re^3: Find duplicate lines from the file and write it into new file. by jettero (Monsignor) on Jan 04, 2007 at 14:10 UTC
Yeah, without knowing anything about the input file I was really just brainstorming... In my experience a 4gig file is probably a log file or something and collisions may not matter very much — it really dpeneds on what the filtered output is intended to do. -Paul	[reply]
Re: Find duplicate lines from the file and write it into new file. by Melly (Chaplain) on Jan 04, 2007 at 15:16 UTC
The two obvious approaches (well, ignoring tied hashes, dbs, etc.) are both crippled - memory requirements will prevent you reading in the whole file (as you've already found), while a line-by-line approach (read a line, check the rest of the file for a duplicate) would be crippling in terms of time... so why not try a hybrid - read in, say, 10,000 lines, check the rest of the file for duplicates, then read the next 10,000? The following code is not complete, or fully tested (e.g. EOF handling?). use strict; my $file = 'data.txt'; my $thiscount; my $fullcount; my $max = 10; # change this to, say, 10000 my %lines; open(INPUT, $file); while(<INPUT>){ chomp; if(exists $lines{$_}){ print "duplicate line (on read):$_\n"; } else{ $lines{$_} = 1; } $thiscount ++; $fullcount ++; if($thiscount >= $max){ my $checkcount=0; open(CHECK, $file); while(<CHECK>){ $checkcount ++; if($checkcount > $fullcount){ chomp; if(exists $lines{$_}){ print "duplicate line (on check):$_\n"; } } } undef %lines; $thiscount = 0; } } [download] `map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2$d$e+$a,$e2 -$d2+$b);$c=$d2+$e2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20` [download] Tom Melly, pm@tomandlu.co.uk	[reply] [d/l] [select]
Re: Findout duplicate lines from the file and write it into new file. by tinita (Parson) on Jan 04, 2007 at 13:26 UTC
to check if a line has been seen before you need to remember all lines you read somehow. but first you could just post the code you tried out here because people don't want to post answers only to see that your code already did that. an untested trial from me would be: `perl -pi -e' $_ = "" if $seen{$_}++' file` of course the hash `%seen` will get quite big (unless you have very much duplicated). edit: oops, you want to print out the duplicates. `perl -ne' print if $seen{$_}++ == 1; ' file > newfile`	[reply] [d/l] [select]
Re: Find duplicate lines from the file and write it into new file. by gam3 (Curate) on Jan 04, 2007 at 15:53 UTC
If the lines are long Digest::MD5 might be able to solve your problem. #!/usr/bin/perl use strict; use Digest::MD5; my $file = shift; my ($input, $check); open($input, $file) or die "Could not open $file: $!"; open($check, $file) or die "Could not open $file: $!"; my %hash; while (!eof($input)) { my $location = tell($input); my $line = readline($input); chomp $line; my $digest = Digest::MD5::md5($line); # $digest = length($line); if (defined(my $ll = $hash{$digest})) { my $d = 0; for my $l (@$ll) { seek($check, $l, 0); my $checkl = readline($check); chomp $checkl; if ($checkl eq $line) { print "DUP $line\n"; $d = 1; last; } } if ($d == 0) { push(@{$hash{$digest}}, $location); } } else { push(@{$hash{$digest}}, $location); } } [download] The seek is really over kill in this case, but would be needed if you used a checksum in place of the Digest::MD5 method. Note: This will only save memory if the average line length is longer than 16 bytes. UPDATE: Changed code to correctly handle problem pointed out by Corion. Solution for wojtyk. This will let you have up to 256 passes. It does assume that there is a random distribution of the first byte of the Digest. #!/usr/bin/perl use strict; use Digest::MD5; my $file = shift; my ($input, $check); open($input, $file) or die "Could not open $file: $!"; open($check, $file) or die "Could not open $file: $!"; my %hash; my $passes = 2; for (my $pass = 0; $pass < $passes; $pass++) { while (!eof($input)) { my $location = tell($input); my $line = readline($input); chomp $line; my $digest = Digest::MD5::md5($line); my $p = ord($digest); if ($p % $passes != $pass) { next; } if (defined(my $ll = $hash{$digest})) { my $d = 0; for my $l (@$ll) { seek($check, $l, 0); my $checkl = readline($check); chomp $checkl; if ($checkl eq $line) { print "DUP $line\n"; $d = 1; last; } } if ($d == 0) { push(@{$hash{$digest}}, $location); } } else { push(@{$hash{$digest}}, $location); } } seek($input, 0, 0); } [download] -- gam3 A picture is worth a thousand words, but takes 200K.	[reply] [d/l] [select]
Re^2: Find duplicate lines from the file and write it into new file. by Corion (Patriarch) on Jan 04, 2007 at 15:58 UTC
This will fail for "second" duplicates that map to the same MD5: Assume that two inputs `i1` and `i2` map to the same hash (`m1`), then your code will correctly discard all duplicate instances of `i1` but incorrectly keep all instances of `i2` for the following input: `i1 i2 i2 i2` [download] You will need to keep a list of the offsets for each MD5, at least for every unique line encountered.	[reply] [d/l] [select]
Re^2: Find duplicate lines from the file and write it into new file. by wojtyk (Friar) on Jan 04, 2007 at 19:15 UTC
I think the Digest method is ideal if you have the memory for it. If 16 bytes per line still consumes too much memory, you could take a piecewise approach. Divide available memory by the number of lines in the file (or more accurately, divide by the average line length if you want to waste time on the computation) That should give you the approximate amount of memory you have per line to work with (roughly...you'll probably need to adjust for overhead). If that number is 16 bytes or greater, just use the digest method. If not, do multiple passes doing piecewise duplicate checking, For instance, if it turns out you only have 6 bytes of memory available per line, well then do one run of the data where you treat only the first 6 characters of the line as the line (ie, store the first 6 characters of the line in the hash, and use that to check for dups against the first 6 characters of each line of the rest of the file). Use that to create a new file. That should produce a smaller file of lines that have duplicates amidst the first 6 characters. Since you now have a smaller (but still lossless) file to work with, you can then run another sweep on the new file checking a larger number of characters. Repeat until accurate. It's ugly, and disk expensive, but if you really don't have the memory available, it may be the only way to accomplish the task.	[reply]
Re: Find duplicate lines from the file and write it into new file. by shigetsu (Hermit) on Jan 04, 2007 at 13:26 UTC
`#!/usr/bin/perl use strict; use warnings; while (<>) { chomp; $duplicates{$_}++; } foreach my $key (keys %duplicates) { if ($duplicates{$key} > 1) { delete $duplicates{$key}; print "$key\n"; } }` [download] Expects to be run as `script < inputfile > outputfile`	[reply] [d/l] [select]
Re^2: Find duplicate lines from the file and write it into new file. by roboticus (Chancellor) on Jan 04, 2007 at 13:56 UTC
Slightly more efficient.... #!/usr/bin/perl -w use strict; use warnings; my %duplicates; while (<DATA>) { print if !defined $duplicates{$_}; $duplicates{$_}++; } __DATA__ The quick red fox jumped over the lazy brown dog. Time flies like an arrow, fruit flies like a banana. Time flies like an arrow, fruit flies like a banana. Now is the time for all good men to come to the aid of their party. The quick red fox jumped over the lazy brown dog. The quick red fox jumped over the lazy brown dog. Time flies like an arrow, fruit flies like a banana. Now is the time for all good men to come to the aid of their party. Now is the time for all good men to come to the aid of their party. [download] --roboticus	[reply] [d/l]
Re^3: Find duplicate lines from the file and write it into new file. by PerlInMyBlood (Initiate) on Jun 28, 2011 at 08:56 UTC
I used the script to find duplicate rows in my log file and it worked well. Thanks! However, it would be great if this script could be updated to display how may times the rows have been repeated as well. Please help!	[reply]
Re^4: Find duplicate lines from the file and write it into new file. by choroba (Cardinal) on Jun 28, 2011 at 09:17 UTC


Welcome to the Monastery
	PerlMonks