Comparing each line of a file to itself

Manju Moorthy has asked for the wisdom of the Perl Monks concerning the following question:

I have a file with each line having dna sequences. I want to know how to compare each line of the file with the entire file, to get the number of times each line in the file has been repeated, using PERL?

each line of the file will contain one dna sequence per line

Comment on Comparing each line of a file to itself

Replies are listed 'Best First'.
Re: Comparing each line of a file to itself by hippo (Bishop) on Jan 13, 2019 at 13:02 UTC
Hello, Manju Moorthy, and welcome to the Monastery. You could do this in Perl but is there any need when `$ sort file \| uniq -c` would seem to be all you require? Just for fun here's a Perl equivalent `$ perl -ne '$x{$_}++;END {printf ("%5i %s", $x{$_}, $_) for keys %x};' file`	[reply] [d/l] [select]
Re: Comparing each line of a file to itself by kschwab (Vicar) on Jan 13, 2019 at 13:59 UTC
If I remember right, DNA sequence files are often very large. You could trade off cpu for memory by comparing hashes of each line instead of the line data itself. If a typical sequence file is 60 characters per line, that's 480 bits, so a 128bit MD5 digest would use significantly less memory, but more cpu. `use warnings; use strict; use Digest::MD5 qw(md5); my %SEEN; while (<>) { chomp; my $digest=md5($_); if ($SEEN{$digest}++) { printf STDOUT "Dup: [%s] seen %d times\n",$_,$SEEN{$digest}; } }` [download]	[reply] [d/l]
Re^2: Comparing each line of a file to itself by bliako (Monsignor) on Jan 13, 2019 at 20:27 UTC
60 characters per line, that's 480 bits why 60x8=480bits when 1 character = `[ATGC]` = 2 bits?	[reply] [d/l]
Re^3: Comparing each line of a file to itself by kschwab (Vicar) on Jan 13, 2019 at 20:59 UTC
Well, yes, a content aware solution could mush down to 2 bits per character. I was proposing, though, something more memory efficient than $SEEN{$_}++. I don't know much about DNA, but googling around a bit, "one DNA sequence per line" could mean 237, 373, etc, characters per line. 373*2= 746, so an MD5 hash could still be significantly smaller. Also, I don't know if OP's file format has comments or other things besides A/T/G/C.	[reply]
Re: Comparing each line of a file to itself by LanX (Saint) on Jan 13, 2019 at 14:53 UTC
I'd say you loop over the lines and count in a hash. Please show us some sample data and what you tried. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]

Back to Seekers of Perl Wisdom