in reply to How to get non-redundant DNA sequences from a FASTA file?
You perhaps know that redundancy of DNA sequences doesn't totally imply a 100% conservation of the sequences. Even sequences that are not 100% identical maybe considered redundant and the few differences between them may not result in any functional impact. So if sequences code for a protein, the selection is such that mutations may still preserve the encoded protein due to the degenracy of the genetic code. That means, you're probably filtering 100% identical sequences but there could be, biologically speaking, other redundant sequences you did not look at.
My answer will be similar to what choroba and Athanasius have suggested, but with a slight modification. The modification is, I list every ID for which sequences are identical and to ease my life a bit I am using BioPerl. Then you can easily just include into your analysis one ID to represent that cluster of sequences
If your dataset is really really huge then you may want to think of clustering based on sequence-similarity as opposed to sequence-identity since you won't lose so much of the biological signals if you define a sensible similarity threshold to cluster around. There are routinely used tools that you can explore towards that purpose like cd-hits-est and uclust for example. UPDATE: 09/09/2015: Predeclared the hash in response to the suggestion provided by Not_a_Number. Since I wrote the code out of my head without testing it I missed the variable declaration.use strict; use warnings; use Data::Dumper; use Bio::SeqIO; my %hash; #updated. #Reading sequence files in Fasta format my $in=Bio::SeqIO->new( -file=> "sequences.fa", -format=>"fasta", ); #getting the IDs of the identical sequences into a data structure while(my $seq=$in->next_seq()){ #print $seq->id,$/; push @{$hash{$seq->seq}}, $seq->id; } #print each group of identical IDs into a separate line foreach my $key(keys %hash){ if(scalar @{$hash{$key}}>=1){ print scalar @{$hash{$key}},"\t"; print "@{$hash{$key}}","\n"; } }
David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...