Hello, Manju Moorthy, and welcome to the Monastery.
You could do this in Perl but is there any need when
$ sort file | uniq -c
would seem to be all you require? Just for fun here's a Perl equivalent
$ perl -ne '$x{$_}++;END {printf ("%5i %s", $x{$_}, $_) for keys %x};' file
| [reply] [d/l] [select] |
If I remember right, DNA sequence files are often very large. You could trade off cpu for memory by comparing hashes of each line instead of the line data itself. If a typical sequence file is 60 characters per line, that's 480 bits, so a 128bit MD5 digest would use significantly less memory, but more cpu.
use warnings;
use strict;
use Digest::MD5 qw(md5);
my %SEEN;
while (<>) {
chomp;
my $digest=md5($_);
if ($SEEN{$digest}++) {
printf STDOUT "Dup: [%s] seen %d times\n",$_,$SEEN{$digest};
}
}
| [reply] [d/l] |
60 characters per line, that's 480 bits
why 60x8=480bits when 1 character = [ATGC] = 2 bits? | [reply] [d/l] |
Well, yes, a content aware solution could mush down to 2 bits per character. I was proposing, though, something more memory efficient than $SEEN{$_}++.
I don't know much about DNA, but googling around a bit, "one DNA sequence per line" could mean 237, 373, etc, characters per line. 373*2= 746, so an MD5 hash could still be significantly smaller.
Also, I don't know if OP's file format has comments or other things besides A/T/G/C.
| [reply] |
| [reply] |