Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

How does one get only the non-redundant (non-repeating) entries with header?

by supriyoch_2008 (Monk)
on Jul 17, 2014 at 11:28 UTC ( [id://1094017]=perlquestion: print w/replies, xml ) Need Help??

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks,

There are four samples (sample1...sample4) each with a different name. Sample 1 & 2 have the same sequence i.e. ATGC. Likewise, sample 3 & 4 have same sequence i.e. CCGG. My interest is to retain only the sample 1 with sequence ATGC and reject sample 2 as the latter shares the same sequence with sample 1. Same is the case for sample 3 & sample 4 i.e. I wish to retain sample 3 and reject sample 4. I am at my wit's end to fix this problem. I am looking forward to suggestions from perl monks regarding this problem.

I have written a script t2.pl (given below) to separate the header and the sequence. Here goes the script:

#!/usr/bin/perl use warnings; use strict; my $a=">sample1 ..sequence ATGC fun >sample2 ..sequence ATGC fun >sample3 ..sequence CCGG fun >sample4 ..sequence CCGG fun"; while ($a=~ />.*?fun/gs) {my $trial1=$&; my $trial2=$&; while ($trial1=~ />.*sequence/gs) {my $header=$&; $trial2=~ s/($header)//gs; my $seq=$trial2; $seq=~ s/\s//; $seq=~ s/fun//; print "\n Header: $header Sequence: $seq\n"; } } # code?? exit;

I have got the results like:

C:\Users\x\Desktop>t2.pl Header: >sample1 ..sequence Sequence: ATGC Header: >sample2 ..sequence Sequence: ATGC Header: >sample3 ..sequence Sequence: CCGG Header: >sample4 ..sequence Sequence: CCGG

But the expected results should look like:

>sample1 ..sequence ATGC >sample3 ..sequence CCGG

Replies are listed 'Best First'.
Re: How does one get only the non-redundant (non-repeating) entries with header?
by roboticus (Chancellor) on Jul 17, 2014 at 11:37 UTC

    supriyoch 2008:

    This is a commonly-asked question, so you should review the perlfaqX documents.

    The typical solution is to store the sequences in a hash as you process them. So just before you process the sequence, check to see if it's in the hash. If so, skip to the next sequence. Then process the sequence and store it in the hash.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Hi roboticus

      Thank you for your suggestions. I shall try to fix the problem.

Re: How does one get only the non-redundant (non-repeating) entries with header?
by ww (Archbishop) on Jul 17, 2014 at 12:45 UTC
    Please learn to use Super Search for questions like this. FAQ-tully (or tutorial-ly), use a hash and the input record separator.

    Done without checking the hash content-- that is, simply stuffing the latest match into the hash -- will give you output of the LAST unduplicated element of each datum. If you care about getting the first "sample\d", use the technique outlined by roboticus.


    check Ln42!

      Hi ww

      Thank you for your suggestions. I shall read the material and try to solve the problem. I am sorry for delayed reply as my internet connectivity is poor.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1094017]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-25 13:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found