Re: Comparing each line of a file to itself


Do you know where your variables are?
	PerlMonks

Re: Comparing each line of a file to itself

by kschwab (Vicar)

on Jan 13, 2019 at 13:59 UTC ( [id://1228466]=note: print w/replies, xml )

Need Help??

in reply to Comparing each line of a file to itself

If I remember right, DNA sequence files are often very large. You could trade off cpu for memory by comparing hashes of each line instead of the line data itself. If a typical sequence file is 60 characters per line, that's 480 bits, so a 128bit MD5 digest would use significantly less memory, but more cpu.

use warnings;
use strict;
use Digest::MD5 qw(md5);
my %SEEN;
while (<>) {
    chomp;
    my $digest=md5($_);
    if ($SEEN{$digest}++) {
       printf STDOUT "Dup: [%s] seen %d times\n",$_,$SEEN{$digest};
    }
}
[download]

Comment on Re: Comparing each line of a file to itself Download Code

Replies are listed 'Best First'.
Re^2: Comparing each line of a file to itself by bliako (Monsignor) on Jan 13, 2019 at 20:27 UTC
60 characters per line, that's 480 bits why 60x8=480bits when 1 character = `[ATGC]` = 2 bits?	[reply] [d/l]
Re^3: Comparing each line of a file to itself by kschwab (Vicar) on Jan 13, 2019 at 20:59 UTC
Well, yes, a content aware solution could mush down to 2 bits per character. I was proposing, though, something more memory efficient than $SEEN{$_}++. I don't know much about DNA, but googling around a bit, "one DNA sequence per line" could mean 237, 373, etc, characters per line. 373*2= 746, so an MD5 hash could still be significantly smaller. Also, I don't know if OP's file format has comments or other things besides A/T/G/C.	[reply]

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://1228466]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others scrutinizing the Monastery: (3)

As of 2024-04-19 22:28 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found