You perhaps know that redundancy of DNA sequences doesn't totally imply a 100% conservation of the sequences. Even sequences that are not 100% identical maybe considered redundant and the few differences between them may not result in any functional impact. So if sequences code for a protein, the selection is such that mutations may still preserve the encoded protein due to the degenracy of the genetic code. That means, you're probably filtering 100% identical sequences but there could be, biologically speaking, other redundant sequences you did not look at.
My answer will be similar to what choroba and Athanasius have suggested, but with a slight modification. The modification is, I list every ID for which sequences are identical and to ease my life a bit I am using BioPerl. Then you can easily just include into your analysis one ID to represent that cluster of sequences
use strict;
use warnings;
use Data::Dumper;
use Bio::SeqIO;
my %hash; #updated.
#Reading sequence files in Fasta format
my $in=Bio::SeqIO->new(
-file=> "sequences.fa",
-format=>"fasta",
);
#getting the IDs of the identical sequences into a data structure
while(my $seq=$in->next_seq()){
#print $seq->id,$/;
push @{$hash{$seq->seq}}, $seq->id;
}
#print each group of identical IDs into a separate line
foreach my $key(keys %hash){
if(scalar @{$hash{$key}}>=1){
print scalar @{$hash{$key}},"\t";
print "@{$hash{$key}}","\n";
}
}
If your dataset is really really huge then you may want to think of clustering based on sequence-similarity as opposed to sequence-identity since you won't lose so much of the biological signals if you define a sensible similarity threshold to cluster around. There are routinely used tools that you can explore towards that purpose like cd-hits-est and uclust for example.
UPDATE: 09/09/2015: Predeclared the hash in response to the suggestion provided by
Not_a_Number. Since I wrote the code out of my head without testing it I missed the variable declaration.
David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.