Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

extraction of sequences

by patric (Acolyte)
on Oct 13, 2009 at 11:09 UTC ( #800885=perlquestion: print w/replies, xml ) Need Help??

patric has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I have a file, where i have to extract specific lines from the file.
# ----- prediction on sequence number 1 (length = 105, name = seq_01) +-- # # Constraints/Hints: # (none) # Predicted genes for sequence number 1 on both strands # start gene g1 seq_01 CHECKED gene 28503 30196 0.89 + . g1 seq_01 CHECKED transcript 28503 30196 0.89 + . + g1.t1 seq_01 CHECKED start_codon 28503 28505 . + 0 t +ranscript_id "g1.t1"; gene_id "g1"; # coding sequence = [atgtcgtccctccccactctcatctttctccaccc # atcgctgcggtcctcgccgacccttttgtgccggaagtagggaccgg] # protein sequence = [MTASAFVLGTVAFLHNRLRRSRPRQASTAHR # GTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATSSSLDLQKTNLAL # VAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL] # end gene g1 ### # start gene g2 seq_01 CHECKED gene 77978 79779 0.44 + . g2 seq_01 CHECKED transcript 77978 79779 0.44 + . + g2.t1 seq_01 CHECKED start_codon 77978 77980 . + 0 t +ranscript_id "g2.t1"; gene_id "g2"; # coding sequence = [atgccgtcctcgtcaaagcagctggcgatgcc # tcggcccctccttctgcaaaccgccctgccgcccgcctcggctcctccgaa # gccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg] # protein sequence =[MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQ # CCPGQGIRHGKWSPGSSEAPT] # end gene g2 ### # # ----- prediction on sequence number 2 (length = 710, name = seq_02) +----- # # Constraints/Hints: # (none) # Predicted genes for sequence number 2 on both strands # start gene g3 seq_02 CHECKED gene 150 2800 0.31 + . g3 seq_02 CHECKED transcript 150 2800 0.31 + . g3 +.t1 seq_02 CHECKED intron 1 149 0.75 + . transcrip +t_id "g3.t1"; gene_id "g3"; # coding sequence = [agctgccctcctcggggccagccttctcttaactc # tttgagaccttcaatcctgaggcgtgagacgcagtctggaggagcagctc] # protein sequence = [LRRETQSGGAALCSLFDPPPTPTACAHANSP] # end gene g3 ### # # ----- prediction on sequence number 3 (length = 713, name = seq_03) +----- # # Constraints/Hints: # (none) # Predicted genes for sequence number 3 on both strands # start gene g4 .... [as same as above] on and on...
From this file, i need to extract sequences to 2 different files like:
FILE 1: >seq_01 g1 atgtcgtccctccccactctcatctttctccacccatcgctgcggtcctcgccgacccttttgtgccgga +agtagggaccgg >seq_01 g2 atgccgtcctcgtcaaagcagctggcgatgcctcggcccctccttctgcaaaccgccctgccgcccgcct +cggctcctccgaagccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg >seq_02 g3 agctgccctcctcggggccagccttctcttaactctttgagaccttcaatcctgaggcgtgagacgcagt +ctggaggagcagctc >seq_03 g4 on... FILE 2: >seq_01 g1 MTASAFVLGTVAFLHNRLRRSRPRQASTAHRGTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATS +SSLDLQKTNLALVAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL >seq_01 g2 MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQCCPGQGIRHGKWSPGSSEAPT >seq_02 g3 LRRETQSGGAALCSLFDPPPTPTACAHANSP >seq_03 g4 on...
The code i have written so far to obtain this is:
#!/usr/bin/perl open(FH,$ARGV[0]); open(OUT1,">file1.txt"); open(OUT2,">file2.txt"); @array=<FH>; $str=join("",@array); @list=split("###",$str); foreach $line(@list){ $line=~m/(# coding sequence = [.*\])(# protein sequence = [.*\])/; print OUT1 "$1"; print OUT2 "$2"; }
I am not getting any answer for this program. havent found how to print the headers too. How can i do it? please advice or give suggestions. thank you. :)

Replies are listed 'Best First'.
Re: extraction of sequences
by jethro (Monsignor) on Oct 13, 2009 at 11:31 UTC

    You escape the closing brace but not the opening brace in your regexp. You should escape both

    Also there is a '\n' between coding sequence and protein sequence you seem to have overlooked. But since you don't look for lines anyway it might make sense to do a 'chomp @array' before you join the lines (instead of adapting the regex).

    You might also use non-greedy .*? instead of .* to make your regex a bit faster.

    Note that the whole string in parenthesis is captured in $1,$2.... If you only want the coding and protein sequences without the text surrounding it you have to shrink the parenthesis to sit just around the .*

    General advice: Please add at least 'use warnings;' to your code, and 'use strict;' is recommended too

Re: extraction of sequences
by BioLion (Curate) on Oct 13, 2009 at 12:17 UTC

    If the above comments don't get you what you want, you might check out and especially the Bio::Seq realted modules. These are designed to do pretty much exactly what you want to do, i.e. munging one formatted output into another. Hope this helps

    Just a something something...
Re: extraction of sequences
by arun_kom (Monk) on Oct 13, 2009 at 14:00 UTC

    The file you want to parse seems to be in GFF format. In that case, you could maybe try the GFF parser from BioPerl ... Bio::Tools::GFF

Re: extraction of sequences
by biohisham (Priest) on Oct 13, 2009 at 23:14 UTC
    I am not familiar with the sequence format presented here to know if BioPerl modules such as Bio::IOSeq can handle it as it is, hence its a good idea to clean up the file first and extract only the needed data, unfortunately after having done just that , I got stuck, I feel I may need to use a data structure like hashes of arrays but I can not land on what exactly I have to do to realize this. Also, I have manually removed the part from your sample file that goes like:
    # ----- prediction on sequence number 3 (length = 713, name = seq_03) +----- # # Constraints/Hints: # (none) # Predicted genes for sequence number 3 on both strands # start gene g4 .... [as same as above] on and on...
    However, following is my initial take at it, I hope a wiser monk than myself can land it at its destination so we can learn something new:
    #!/usr/local/bin/perl #Title "extraction of sequences" #saved the sample file in bioinfo.txt use strict; use warnings; use IO::File; my $handle = new IO::File; $handle->autoflush(1); $handle->open("<bioinfo.txt") or die("$!"); my @input_array; my @new_array; @input_array=<$handle>; @input_array = grep {s/#//g} @input_array; for (my $i=0; $i<$#input_array; $i++){ chomp $input_array[$i]; delete $input_array[$i] if $input_array[$i]=~ /((none)|checked +|constraints|predicted)/i; #shedding extras next unless $input_array[$i]; #ignoring empty lines. push @new_array, $input_array[$i]; #capturing the element +s that I need } for(my $i=0;$i<$#new_array;$i++){ print "$i-$new_array[$i]\n"; #preparing for further pr +ocessing }

    Here is the output from the snippet above:
    0- ----- prediction on sequence number 1 (length = 105, name = seq_01) + -- 1- start gene g1 2- coding sequence = [atgtcgtccctccccactctcatctttctccaccc 3- atcgctgcggtcctcgccgacccttttgtgccggaagtagggaccgg] 4- protein sequence = [MTASAFVLGTVAFLHNRLRRSRPRQASTAHR 5- GTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATSSSLDLQKTNLAL 6- VAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL] 7- end gene g1 8- start gene g2 9- coding sequence = [atgccgtcctcgtcaaagcagctggcgatgcc 10- tcggcccctccttctgcaaaccgccctgccgcccgcctcggctcctccgaa 11- gccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg] 12- protein sequence =[MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQ 13- CCPGQGIRHGKWSPGSSEAPT] 14- end gene g2 15- ----- prediction on sequence number 2 (length = 710, name = seq_02 +) ----- 16- start gene g3 17- coding sequence = [agctgccctcctcggggccagccttctcttaactc 18- tttgagaccttcaatcctgaggcgtgagacgcagtctggaggagcagctc] 19- protein sequence = [LRRETQSGGAALCSLFDPPPTPTACAHANSP]
    Out of curiosity I have translated the gene sequence and also backtranslated the proteins on ExPasy but they're giving out different results than what is in the sample file you provided and are unrelated. Any clues ?

    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
      hi.sorry for disappointing you. The DNA sequence and the protein sequences are different here. the actual file is really huge and has long sequences. i have just copied some random sequences here for an example. The real data of the DNA do code for its corresponding proteins. Well, anyways, thanks for trying out. thanks for the code. i will try extracting out the sequences to seperate files now. :)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://800885]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2021-12-09 10:56 GMT
Find Nodes?
    Voting Booth?
    R or B?

    Results (36 votes). Check out past polls.