Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to get non-redundant DNA sequences from a FASTA file?

by supriyoch_2008 (Monk)
on Sep 13, 2014 at 04:43 UTC ( [id://1100457]=perlquestion: print w/replies, xml ) Need Help??

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks

I am interested in getting the non-redundant DNA sequences from a FASTA file. Two sequences may have different headers but have the same DNA sequence. I want any one of these two sequences with its header in the output, not both. I have written a script (try.pl) but it does not give the results I want. I seek help from perl monks to fix this problem.

Here goes the script try.pl:

#!/usr/bin/perl use warnings; $a=">gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun"; while ($a=~ m/(>.*?fun)/gs) { $b1=$&; $b2=$&; while ($b1=~ />.*?cds/gs) { $h=$&; $b2=~ s/$h//g; $b2=~ s/fun//g; $seq=$b2; $seq=~ s/\n//; $hdr_seq="$h\n"."$seq\n"; push @hdr_seq1,$hdr_seq; push @only_seq1,$seq; } } # To remove multiple copies (if any): my %seen; # declare a hash my @only_seq=(); my @hdr_seq=(); @only_seq=grep{!$seen{$_}++}@only_seq1; @hdr_seq=grep{!$seen{$_}++}@hdr_seq1; print "\n\n A. Header & sequences are:\n\n"; print join ("\n", @hdr_seq); print "\n"; print "\n B. Only sequences are:\n\n"; print join ("\n", @only_seq); print "\n\n"; $num=0; foreach my $item1 (@only_seq) {$num++; # No.1 curly $seq1=$item1; foreach my $item2 (@hdr_seq) { # No.2 curly if (defined $item2) { $item2=$item2; $item3=$item2;} while ($item2=~ m/>.*cds/gs) { $hdr2=$&; $item3=~ s/$hdr2//; $item3=~ s/\s//; $seq2=$item3; $ele2="$hdr2\n"."$seq2\n"; if ($seq1 eq $seq2) {push @result1,$ele2;} else {push @result1,$ele2;} } } # No.2 curly } # No.1 curly ###################################### my @result=(); @result=grep{!$seen{$_}++}@result1; print "\n C. Non-redundant sequences are:\n\n"; print join ("",@result); print "\n"; exit;
##################

The results of the script go like:

Microsoft Windows [Version 6.1.7600] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\x\Desktop>try.pl A. Header & sequences are: >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG B. Only sequences are: ATG GGG C. Non-redundant sequences are: (This is wrong) >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG

Correct results for Non-redundant sequences should be like:

>gi1 cds ATG >gi3 cds GGG

Replies are listed 'Best First'.
Re: How to get non-redundant DNA sequences from a FASTA file?
by choroba (Cardinal) on Sep 13, 2014 at 08:49 UTC
    When you hear "unique", think "hash". In this case, you need to hash headers by sequences:
    #!/usr/bin/perl use warnings; use strict; my $fasta = << '__FASTA__'; >gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun __FASTA__ my @seq_with_hdr = split /\n>/, $fasta; $seq_with_hdr[0] =~ s/^>//; my %hdr_by_seq; for (@seq_with_hdr) { my ($hdr, $seq) = split /\n/; $hdr_by_seq{$seq} = $hdr; } for my $seq (keys %hdr_by_seq) { print ">$hdr_by_seq{$seq}\n$seq\n" }

    Note that whitespace is not ignored in the data. There was a space after one of "ATG FUN" sequences which makes it different to the same sequence without the trailing space. I removed the space in my code.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Hi Choroba,

      Thank you very much for fixing the problem and providing me valuable suggestions regarding unique (array) and whitespace. I shall follow your suggestions. I searched in google for fixing this problem using perl code but I didn't get any such information. But I found a script based on Java program which I do not know. The URL for java solution is http://seqanswers.com/forums/showthread.php?t=4442

      So, I wrote to perl monks for help.

      With regards

Re: How to get non-redundant DNA sequences from a FASTA file?
by Athanasius (Archbishop) on Sep 13, 2014 at 09:08 UTC

    Hello supriyoch_2008,

    Same approach as choroba, except that I read the data into a hash mapping headers to sequences, and then reverse that hash to get the non-redundant sequences:

    #! perl use strict; use warnings; my $fasta = '>gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun'; my %hdrs; $hdrs{$1} = $2 while $fasta =~ / > (.+) \s+ cds \s+ (.*) \s+ fun /g +x; print " A. Header & sequences are:\n"; printf ">%s cds\n%s\n", $_, $hdrs{$_} for sort keys %hdrs; my %seqs; while (my ($key, $value) = each %hdrs) { push @{$seqs{$value}}, $key; } print " B. Only sequences are:\n"; printf "$_\n" for sort keys %seqs; print " C. Non-redundant sequences are:\n"; printf ">%s cds\n%s\n", ( sort @{$seqs{$_}} )[0], $_ for sort keys %se +qs;

    Output:

    18:47 >perl 1009_SoPW.pl A. Header & sequences are: >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG B. Only sequences are: ATG GGG C. Non-redundant sequences are: >gi1 cds ATG >gi3 cds GGG 18:47 >

    Note: on hash reversal, see How-do-I-look-up-a-hash-element-by-value of perlfaq4.

    If you really don’t care which header is output when there are redundant sequences, you can just say:

    my %seqs = reverse %hdrs; ... print " C. Non-redundant sequences are:\n"; printf ">%s cds\n%s\n", $seqs{$_}, $_ for sort keys %seqs;

    When your code starts to get too complicated, it’s usually a good idea to step back and look for a simpler approach. Remember, less is more. ;-)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hi Athanasius,

      Thank you for your suggestions and fixing the problem. Your script has worked nicely. I am grateful to you.

      With regards,

Re: How to get non-redundant DNA sequences from a FASTA file?
by biohisham (Priest) on Sep 15, 2014 at 05:48 UTC

    You perhaps know that redundancy of DNA sequences doesn't totally imply a 100% conservation of the sequences. Even sequences that are not 100% identical maybe considered redundant and the few differences between them may not result in any functional impact. So if sequences code for a protein, the selection is such that mutations may still preserve the encoded protein due to the degenracy of the genetic code. That means, you're probably filtering 100% identical sequences but there could be, biologically speaking, other redundant sequences you did not look at.

    My answer will be similar to what choroba and Athanasius have suggested, but with a slight modification. The modification is, I list every ID for which sequences are identical and to ease my life a bit I am using BioPerl. Then you can easily just include into your analysis one ID to represent that cluster of sequences

    use strict; use warnings; use Data::Dumper; use Bio::SeqIO; my %hash; #updated. #Reading sequence files in Fasta format my $in=Bio::SeqIO->new( -file=> "sequences.fa", -format=>"fasta", ); #getting the IDs of the identical sequences into a data structure while(my $seq=$in->next_seq()){ #print $seq->id,$/; push @{$hash{$seq->seq}}, $seq->id; } #print each group of identical IDs into a separate line foreach my $key(keys %hash){ if(scalar @{$hash{$key}}>=1){ print scalar @{$hash{$key}},"\t"; print "@{$hash{$key}}","\n"; } }
    If your dataset is really really huge then you may want to think of clustering based on sequence-similarity as opposed to sequence-identity since you won't lose so much of the biological signals if you define a sensible similarity threshold to cluster around. There are routinely used tools that you can explore towards that purpose like cd-hits-est and uclust for example.

    UPDATE: 09/09/2015: Predeclared the hash in response to the suggestion provided by Not_a_Number. Since I wrote the code out of my head without testing it I missed the variable declaration.


    David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...
      Hi biohisham,

      I am trying to use your solution for redundant fasta file but Ive got this message:Global symbol "%hash" requires explicit package name at redundant.fasta line 14.

      Global symbol "%hash" requires explicit package name at redundant.fast +a line 18. Global symbol "%hash" requires explicit package name at redundant.fast +a line 19. Global symbol "%hash" requires explicit package name at redundant.fast +a line 20. Global symbol "%hash" requires explicit package name at redundant.fast +a line 21.
      I installed BioPerl: cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz Data-Dumper: cpan>SMUELLER/Data-Dumper-2.154.tar.gz and also EDWARDSON/Data-Dumper-Hash-0.001.tar.gz

      I really appreciate your answer about how to solve it. Btw, I am newbie. Thanks Eric

        You (and biohisham) need to declare %hash before using it. Simply put this line:

        my %hash;

        before the first while loop.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1100457]
Approved by Athanasius
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-26 00:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found