Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: How to get non-redundant DNA sequences from a FASTA file?

by biohisham (Priest)
on Sep 15, 2014 at 05:48 UTC ( #1100556=note: print w/replies, xml ) Need Help??

in reply to How to get non-redundant DNA sequences from a FASTA file?

You perhaps know that redundancy of DNA sequences doesn't totally imply a 100% conservation of the sequences. Even sequences that are not 100% identical maybe considered redundant and the few differences between them may not result in any functional impact. So if sequences code for a protein, the selection is such that mutations may still preserve the encoded protein due to the degenracy of the genetic code. That means, you're probably filtering 100% identical sequences but there could be, biologically speaking, other redundant sequences you did not look at.

My answer will be similar to what choroba and Athanasius have suggested, but with a slight modification. The modification is, I list every ID for which sequences are identical and to ease my life a bit I am using BioPerl. Then you can easily just include into your analysis one ID to represent that cluster of sequences

use strict; use warnings; use Data::Dumper; use Bio::SeqIO; my %hash; #updated. #Reading sequence files in Fasta format my $in=Bio::SeqIO->new( -file=> "sequences.fa", -format=>"fasta", ); #getting the IDs of the identical sequences into a data structure while(my $seq=$in->next_seq()){ #print $seq->id,$/; push @{$hash{$seq->seq}}, $seq->id; } #print each group of identical IDs into a separate line foreach my $key(keys %hash){ if(scalar @{$hash{$key}}>=1){ print scalar @{$hash{$key}},"\t"; print "@{$hash{$key}}","\n"; } }
If your dataset is really really huge then you may want to think of clustering based on sequence-similarity as opposed to sequence-identity since you won't lose so much of the biological signals if you define a sensible similarity threshold to cluster around. There are routinely used tools that you can explore towards that purpose like cd-hits-est and uclust for example.

UPDATE: 09/09/2015: Predeclared the hash in response to the suggestion provided by Not_a_Number. Since I wrote the code out of my head without testing it I missed the variable declaration.

David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...

Replies are listed 'Best First'.
Re^2: How to get non-redundant DNA sequences from a FASTA file?
by Anonymous Monk on Sep 06, 2015 at 12:14 UTC
    Hi biohisham,

    I am trying to use your solution for redundant fasta file but Ive got this message:Global symbol "%hash" requires explicit package name at redundant.fasta line 14.

    Global symbol "%hash" requires explicit package name at +a line 18. Global symbol "%hash" requires explicit package name at +a line 19. Global symbol "%hash" requires explicit package name at +a line 20. Global symbol "%hash" requires explicit package name at +a line 21.
    I installed BioPerl: cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz Data-Dumper: cpan>SMUELLER/Data-Dumper-2.154.tar.gz and also EDWARDSON/Data-Dumper-Hash-0.001.tar.gz

    I really appreciate your answer about how to solve it. Btw, I am newbie. Thanks Eric

      You (and biohisham) need to declare %hash before using it. Simply put this line:

      my %hash;

      before the first while loop.

        Hi Not_a_Number, Many thanks for your suggestion. Now the script is working nicely. Regards.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1100556]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2021-10-22 21:29 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (85 votes). Check out past polls.