Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Comparing each line of a file to itself

by Manju Moorthy (Initiate)
on Jan 13, 2019 at 12:42 UTC ( #1228463=perlquestion: print w/replies, xml ) Need Help??

Manju Moorthy has asked for the wisdom of the Perl Monks concerning the following question:

I have a file with each line having dna sequences. I want to know how to compare each line of the file with the entire file, to get the number of times each line in the file has been repeated, using PERL?

each line of the file will contain one dna sequence per line

  • Comment on Comparing each line of a file to itself

Replies are listed 'Best First'.
Re: Comparing each line of a file to itself
by hippo (Bishop) on Jan 13, 2019 at 13:02 UTC

    Hello, Manju Moorthy, and welcome to the Monastery.

    You could do this in Perl but is there any need when

    $ sort file | uniq -c

    would seem to be all you require? Just for fun here's a Perl equivalent

    $ perl -ne '$x{$_}++;END {printf ("%5i %s", $x{$_}, $_) for keys %x};' file
Re: Comparing each line of a file to itself
by kschwab (Vicar) on Jan 13, 2019 at 13:59 UTC
    If I remember right, DNA sequence files are often very large. You could trade off cpu for memory by comparing hashes of each line instead of the line data itself. If a typical sequence file is 60 characters per line, that's 480 bits, so a 128bit MD5 digest would use significantly less memory, but more cpu.
    use warnings; use strict; use Digest::MD5 qw(md5); my %SEEN; while (<>) { chomp; my $digest=md5($_); if ($SEEN{$digest}++) { printf STDOUT "Dup: [%s] seen %d times\n",$_,$SEEN{$digest}; } }
       60 characters per line, that's 480 bits

      why 60x8=480bits when 1 character = [ATGC] = 2 bits?

        Well, yes, a content aware solution could mush down to 2 bits per character. I was proposing, though, something more memory efficient than $SEEN{$_}++.

        I don't know much about DNA, but googling around a bit, "one DNA sequence per line" could mean 237, 373, etc, characters per line. 373*2= 746, so an MD5 hash could still be significantly smaller.

        Also, I don't know if OP's file format has comments or other things besides A/T/G/C.

Re: Comparing each line of a file to itself
by LanX (Sage) on Jan 13, 2019 at 14:53 UTC
    I'd say you loop over the lines and count in a hash.

    Please show us some sample data and what you tried.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1228463]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2022-05-28 08:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (99 votes). Check out past polls.

    Notices?