http://qs321.pair.com?node_id=800885

patric has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I have a file, where i have to extract specific lines from the file.
# ----- prediction on sequence number 1 (length = 105, name = seq_01) +-- # # Constraints/Hints: # (none) # Predicted genes for sequence number 1 on both strands # start gene g1 seq_01 CHECKED gene 28503 30196 0.89 + . g1 seq_01 CHECKED transcript 28503 30196 0.89 + . + g1.t1 seq_01 CHECKED start_codon 28503 28505 . + 0 t +ranscript_id "g1.t1"; gene_id "g1"; # coding sequence = [atgtcgtccctccccactctcatctttctccaccc # atcgctgcggtcctcgccgacccttttgtgccggaagtagggaccgg] # protein sequence = [MTASAFVLGTVAFLHNRLRRSRPRQASTAHR # GTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATSSSLDLQKTNLAL # VAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL] # end gene g1 ### # start gene g2 seq_01 CHECKED gene 77978 79779 0.44 + . g2 seq_01 CHECKED transcript 77978 79779 0.44 + . + g2.t1 seq_01 CHECKED start_codon 77978 77980 . + 0 t +ranscript_id "g2.t1"; gene_id "g2"; # coding sequence = [atgccgtcctcgtcaaagcagctggcgatgcc # tcggcccctccttctgcaaaccgccctgccgcccgcctcggctcctccgaa # gccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg] # protein sequence =[MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQ # CCPGQGIRHGKWSPGSSEAPT] # end gene g2 ### # # ----- prediction on sequence number 2 (length = 710, name = seq_02) +----- # # Constraints/Hints: # (none) # Predicted genes for sequence number 2 on both strands # start gene g3 seq_02 CHECKED gene 150 2800 0.31 + . g3 seq_02 CHECKED transcript 150 2800 0.31 + . g3 +.t1 seq_02 CHECKED intron 1 149 0.75 + . transcrip +t_id "g3.t1"; gene_id "g3"; # coding sequence = [agctgccctcctcggggccagccttctcttaactc # tttgagaccttcaatcctgaggcgtgagacgcagtctggaggagcagctc] # protein sequence = [LRRETQSGGAALCSLFDPPPTPTACAHANSP] # end gene g3 ### # # ----- prediction on sequence number 3 (length = 713, name = seq_03) +----- # # Constraints/Hints: # (none) # Predicted genes for sequence number 3 on both strands # start gene g4 .... [as same as above]......so on and on...
From this file, i need to extract sequences to 2 different files like:
FILE 1: >seq_01 g1 atgtcgtccctccccactctcatctttctccacccatcgctgcggtcctcgccgacccttttgtgccgga +agtagggaccgg >seq_01 g2 atgccgtcctcgtcaaagcagctggcgatgcctcggcccctccttctgcaaaccgccctgccgcccgcct +cggctcctccgaagccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg >seq_02 g3 agctgccctcctcggggccagccttctcttaactctttgagaccttcaatcctgaggcgtgagacgcagt +ctggaggagcagctc >seq_03 g4 ......so on... FILE 2: >seq_01 g1 MTASAFVLGTVAFLHNRLRRSRPRQASTAHRGTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATS +SSLDLQKTNLALVAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL >seq_01 g2 MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQCCPGQGIRHGKWSPGSSEAPT >seq_02 g3 LRRETQSGGAALCSLFDPPPTPTACAHANSP >seq_03 g4 ......so on...
The code i have written so far to obtain this is:
#!/usr/bin/perl open(FH,$ARGV[0]); open(OUT1,">file1.txt"); open(OUT2,">file2.txt"); @array=<FH>; $str=join("",@array); @list=split("###",$str); foreach $line(@list){ $line=~m/(# coding sequence = [.*\])(# protein sequence = [.*\])/; print OUT1 "$1"; print OUT2 "$2"; }
I am not getting any answer for this program. havent found how to print the headers too. How can i do it? please advice or give suggestions. thank you. :)