How does one get only the non-redundant (non-repeating) entries with header?

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks,

There are four samples (sample1...sample4) each with a different name. Sample 1 & 2 have the same sequence i.e. ATGC. Likewise, sample 3 & 4 have same sequence i.e. CCGG. My interest is to retain only the sample 1 with sequence ATGC and reject sample 2 as the latter shares the same sequence with sample 1. Same is the case for sample 3 & sample 4 i.e. I wish to retain sample 3 and reject sample 4. I am at my wit's end to fix this problem. I am looking forward to suggestions from perl monks regarding this problem.

I have written a script t2.pl (given below) to separate the header and the sequence. Here goes the script:

#!/usr/bin/perl 
use warnings; 
use strict;  

my $a=">sample1 ..sequence
ATGC fun
>sample2 ..sequence
ATGC fun
>sample3 ..sequence
CCGG fun
>sample4 ..sequence
CCGG fun"; 

 while ($a=~ />.*?fun/gs) {my $trial1=$&; my $trial2=$&;     

    while ($trial1=~ />.*sequence/gs) {my $header=$&;  
           $trial2=~ s/($header)//gs; 
           my $seq=$trial2; 
           $seq=~ s/\s//;
           $seq=~ s/fun//;   

    print "\n Header: $header
 Sequence: $seq\n"; 
    } 
 } 
# code?? 
exit;
[download]

I have got the results like:

 C:\Users\x\Desktop>t2.pl

 Header: >sample1 ..sequence
 Sequence: ATGC

 Header: >sample2 ..sequence
 Sequence: ATGC

 Header: >sample3 ..sequence
 Sequence: CCGG

 Header: >sample4 ..sequence
 Sequence: CCGG
[download]

But the expected results should look like:

>sample1 ..sequence
ATGC
>sample3 ..sequence
CCGG
[download]

Comment on How does one get only the non-redundant (non-repeating) entries with header? Select or Download Code

Replies are listed 'Best First'.
Re: How does one get only the non-redundant (non-repeating) entries with header? by roboticus (Chancellor) on Jul 17, 2014 at 11:37 UTC
supriyoch 2008: This is a commonly-asked question, so you should review the perlfaqX documents. The typical solution is to store the sequences in a hash as you process them. So just before you process the sequence, check to see if it's in the hash. If so, skip to the next sequence. Then process the sequence and store it in the hash. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^2: How does one get only the non-redundant (non-repeating) entries with header? by supriyoch_2008 (Monk) on Jul 19, 2014 at 09:14 UTC
Hi roboticus Thank you for your suggestions. I shall try to fix the problem.	[reply]
Re: How does one get only the non-redundant (non-repeating) entries with header? by ww (Archbishop) on Jul 17, 2014 at 12:45 UTC
Please learn to use Super Search for questions like this. FAQ-tully (or tutorial-ly), use a hash and the input record separator. Done without checking the hash content-- that is, simply stuffing the latest match into the hash -- will give you output of the LAST unduplicated element of each datum. If you care about getting the first "`sample\d`", use the technique outlined by roboticus. check Ln42!	[reply] [d/l]
Re^2: How does one get only the non-redundant (non-repeating) entries with header? by supriyoch_2008 (Monk) on Jul 19, 2014 at 09:18 UTC
Hi ww Thank you for your suggestions. I shall read the material and try to solve the problem. I am sorry for delayed reply as my internet connectivity is poor.	[reply]


Do you know where your variables are?
	PerlMonks