Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^8: how to read input from a file, one section at a time?

by davi54 (Sexton)
on Apr 01, 2019 at 21:59 UTC ( #1231963=note: print w/replies, xml ) Need Help??


in reply to Re^7: how to read input from a file, one section at a time?
in thread how to read input from a file, one section at a time?

Hi Poj,

Thanks again for your prompt help. I really appreciate it. The script works perfect. Although I have a small issue. Actually my input file has multiple duplicate entries. Is there any way to get rid of duplicate entries from the file before starting with the actual analysis that this script does? I was thinking if there is a way to compare the fasta headers before getting rid of them to check if there are duplicate entries? It can be a separate script (which can be run before this one) or can be a part of this script.

Again, thank you so much for your help and time.

  • Comment on Re^8: how to read input from a file, one section at a time?

Replies are listed 'Best First'.
Re^9: how to read input from a file, one section at a time?
by AnomalousMonk (Bishop) on Apr 01, 2019 at 23:31 UTC

    From poj's code:

    my $name; while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.*)//m ){ $name = $1; }; ... }
    A quick and dirty and UNTESTED modification to do what I think you want:
    my $name; my %name_seen; # fasta headers seen so far FASTA_RECORD: while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.*)//m ){ $name = $1; next FASTA_RECORD if $name_seen{ $name }++; }; ... }
    Warning: The requirement to "... get rid of duplicate entries ..." is ambiguous. If there is more than one entry with the same header (i.e., $name), which is (or are, if there are more than two) the duplicate(s)? The first one? The last one? Etc. The code modification above ignores all entries with a given $name after the first one. Also, it might be wise to trim all leading/trailing whitespace from $name before any further processing whatsoever (also untested):
        $name = $1;
        $name =~ s{ \A \s+ | \s+ \z }{}xmsg;


    Give a man a fish:  <%-{-{-{-<

      Hi,

      My apologies for not being clear. Just to let you know, multiple proteins can have different header sequences but identical sequence information. When I say duplicate entries, I mean the actual sequence (and not the header). I want the script to read the input file and identify if there are more than one entries with the same sequence information and print them. Does that help? Again, sorry for the confusion and thank you for your help.

        Try

        my %fasta_seen; FASTA_RECORD: while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.*)//m ){ $name = $1; }; # Remove comment line(s) $para =~ s/^\s*#.*//mg; next FASTA_RECORD if $fasta_seen{ $para }++;

        This may not be a sensible solution if your sequences are very long in which case consider using a message digest like Digest::MD5

        poj

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1231963]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (1)
As of 2020-10-25 06:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (249 votes). Check out past polls.

    Notices?