Comparing each line of a file to itself

Manju Moorthy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparing each line of a file to itself by hippo (Bishop) on Jan 13, 2019 at 13:02 UTC
Hello, Manju Moorthy, and welcome to the Monastery. You could do this in Perl but is there any need when `$ sort file \| uniq -c` would seem to be all you require? Just for fun here's a Perl equivalent `$ perl -ne '$x{$_}++;END {printf ("%5i %s", $x{$_}, $_) for keys %x};' file`	[reply] [d/l] [select]
Re: Comparing each line of a file to itself by kschwab (Vicar) on Jan 13, 2019 at 13:59 UTC
If I remember right, DNA sequence files are often very large. You could trade off cpu for memory by comparing hashes of each line instead of the line data itself. If a typical sequence file is 60 characters per line, that's 480 bits, so a 128bit MD5 digest would use significantly less memory, but more cpu. `use warnings; use strict; use Digest::MD5 qw(md5); my %SEEN; while (<>) { chomp; my $digest=md5($_); if ($SEEN{$digest}++) { printf STDOUT "Dup: [%s] seen %d times\n",$_,$SEEN{$digest}; } }` [download]	[reply] [d/l]
Re^2: Comparing each line of a file to itself by bliako (Monsignor) on Jan 13, 2019 at 20:27 UTC
60 characters per line, that's 480 bits why 60x8=480bits when 1 character = `[ATGC]` = 2 bits?	[reply] [d/l]
Re^3: Comparing each line of a file to itself by kschwab (Vicar) on Jan 13, 2019 at 20:59 UTC
Well, yes, a content aware solution could mush down to 2 bits per character. I was proposing, though, something more memory efficient than $SEEN{$_}++. I don't know much about DNA, but googling around a bit, "one DNA sequence per line" could mean 237, 373, etc, characters per line. 373*2= 746, so an MD5 hash could still be significantly smaller. Also, I don't know if OP's file format has comments or other things besides A/T/G/C.	[reply]
Re: Comparing each line of a file to itself by LanX (Saint) on Jan 13, 2019 at 14:53 UTC
I'd say you loop over the lines and count in a hash. Please show us some sample data and what you tried. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]


Perl-Sensitive Sunglasses
	PerlMonks