P is for Practical | |
PerlMonks |
Re^3: Comparing each line of a file to itselfby kschwab (Vicar) |
on Jan 13, 2019 at 20:59 UTC ( [id://1228491]=note: print w/replies, xml ) | Need Help?? |
Well, yes, a content aware solution could mush down to 2 bits per character. I was proposing, though, something more memory efficient than $SEEN{$_}++. I don't know much about DNA, but googling around a bit, "one DNA sequence per line" could mean 237, 373, etc, characters per line. 373*2= 746, so an MD5 hash could still be significantly smaller. Also, I don't know if OP's file format has comments or other things besides A/T/G/C.
In Section
Seekers of Perl Wisdom
|
|