note
kschwab
If I remember right, DNA sequence files are often very large. You could trade off cpu for memory by comparing hashes of each line instead of the line data itself. If a typical sequence file is 60 characters per line, that's 480 bits, so a 128bit MD5 digest would use significantly less memory, but more cpu.
<code>
use warnings;
use strict;
use Digest::MD5 qw(md5);
my %SEEN;
while (<>) {
chomp;
my $digest=md5($_);
if ($SEEN{$digest}++) {
printf STDOUT "Dup: [%s] seen %d times\n",$_,$SEEN{$digest};
}
}
</code>
1228463
1228463