comment on

You perhaps know that redundancy of DNA sequences doesn't totally imply a 100% conservation of the sequences. Even sequences that are not 100% identical maybe considered redundant and the few differences between them may not result in any functional impact. So if sequences code for a protein, the selection is such that mutations may still preserve the encoded protein due to the degenracy of the genetic code. That means, you're probably filtering 100% identical sequences but there could be, biologically speaking, other redundant sequences you did not look at.

My answer will be similar to what choroba and Athanasius have suggested, but with a slight modification. The modification is, I list every ID for which sequences are identical and to ease my life a bit I am using BioPerl. Then you can easily just include into your analysis one ID to represent that cluster of sequences

use strict;
use warnings;
use Data::Dumper;
use Bio::SeqIO;

my %hash; #updated.
#Reading sequence files in Fasta format
my $in=Bio::SeqIO->new(
    -file=> "sequences.fa",
    -format=>"fasta",
    );
#getting the IDs of the identical sequences into a data structure
while(my $seq=$in->next_seq()){
    #print $seq->id,$/;
    push @{$hash{$seq->seq}}, $seq->id;
    }

#print each group of identical IDs into a separate line
foreach my $key(keys %hash){
    if(scalar @{$hash{$key}}>=1){
        print scalar @{$hash{$key}},"\t";
        print "@{$hash{$key}}","\n";
        }
    }
[download]

If your dataset is really really huge then you may want to think of clustering based on sequence-similarity as opposed to sequence-identity since you won't lose so much of the biological signals if you define a sensible similarity threshold to cluster around. There are routinely used tools that you can explore towards that purpose like cd-hits-est and uclust for example.

UPDATE: 09/09/2015: Predeclared the hash in response to the suggestion provided by Not_a_Number. Since I wrote the code out of my head without testing it I missed the variable declaration.

David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...

In reply to Re: How to get non-redundant DNA sequences from a FASTA file? by biohisham
in thread How to get non-redundant DNA sequences from a FASTA file? by supriyoch_2008

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks