extraction of sequences

patric has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I have a file, where i have to extract specific lines from the file.

# ----- prediction on sequence number 1 (length = 105, name = seq_01) 
+--
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 1 on both strands
# start gene g1
seq_01    CHECKED    gene    28503    30196    0.89    +    .    g1
seq_01    CHECKED    transcript    28503    30196    0.89    +    .   
+ g1.t1
seq_01    CHECKED    start_codon    28503    28505    .    +    0    t
+ranscript_id "g1.t1"; gene_id "g1";
# coding sequence = [atgtcgtccctccccactctcatctttctccaccc
# atcgctgcggtcctcgccgacccttttgtgccggaagtagggaccgg]
# protein sequence = [MTASAFVLGTVAFLHNRLRRSRPRQASTAHR
# GTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATSSSLDLQKTNLAL
# VAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL]
# end gene g1
###
# start gene g2
seq_01    CHECKED    gene    77978    79779    0.44    +    .    g2
seq_01    CHECKED    transcript    77978    79779    0.44    +    .   
+ g2.t1
seq_01    CHECKED    start_codon    77978    77980    .    +    0    t
+ranscript_id "g2.t1"; gene_id "g2";
# coding sequence = [atgccgtcctcgtcaaagcagctggcgatgcc
# tcggcccctccttctgcaaaccgccctgccgcccgcctcggctcctccgaa
# gccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg]
# protein sequence =[MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQ
# CCPGQGIRHGKWSPGSSEAPT]
# end gene g2
###
#
# ----- prediction on sequence number 2 (length = 710, name = seq_02) 
+-----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 2 on both strands
# start gene g3
seq_02    CHECKED    gene    150    2800    0.31    +    .    g3
seq_02    CHECKED    transcript    150    2800    0.31    +    .    g3
+.t1
seq_02    CHECKED    intron    1    149    0.75    +    .    transcrip
+t_id "g3.t1"; gene_id "g3";
# coding sequence = [agctgccctcctcggggccagccttctcttaactc
# tttgagaccttcaatcctgaggcgtgagacgcagtctggaggagcagctc]
# protein sequence = [LRRETQSGGAALCSLFDPPPTPTACAHANSP]
# end gene g3
###
#
# ----- prediction on sequence number 3 (length = 713, name = seq_03) 
+-----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 3 on both strands
# start gene g4 ....
[as same as above]......so on and on...
[download]

From this file, i need to extract sequences to 2 different files like:

FILE 1:
>seq_01 g1
atgtcgtccctccccactctcatctttctccacccatcgctgcggtcctcgccgacccttttgtgccgga
+agtagggaccgg
>seq_01 g2
atgccgtcctcgtcaaagcagctggcgatgcctcggcccctccttctgcaaaccgccctgccgcccgcct
+cggctcctccgaagccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg
>seq_02 g3
agctgccctcctcggggccagccttctcttaactctttgagaccttcaatcctgaggcgtgagacgcagt
+ctggaggagcagctc
>seq_03 g4
......so on...

FILE 2:
>seq_01 g1
MTASAFVLGTVAFLHNRLRRSRPRQASTAHRGTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATS
+SSLDLQKTNLALVAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL
>seq_01 g2
MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQCCPGQGIRHGKWSPGSSEAPT
>seq_02 g3
LRRETQSGGAALCSLFDPPPTPTACAHANSP
>seq_03 g4
......so on...
[download]

The code i have written so far to obtain this is:

#!/usr/bin/perl
open(FH,$ARGV[0]);
open(OUT1,">file1.txt");
open(OUT2,">file2.txt");
@array=<FH>;
$str=join("",@array);
@list=split("###",$str);
foreach $line(@list){
    $line=~m/(# coding sequence = [.*\])(# protein sequence = [.*\])/;
  print OUT1 "$1";
  print OUT2 "$2";
}
[download]

I am not getting any answer for this program. havent found how to print the headers too. How can i do it? please advice or give suggestions. thank you. :)

Comment on extraction of sequences Select or Download Code

Replies are listed 'Best First'.
Re: extraction of sequences by jethro (Monsignor) on Oct 13, 2009 at 11:31 UTC
You escape the closing brace but not the opening brace in your regexp. You should escape both Also there is a '\n' between coding sequence and protein sequence you seem to have overlooked. But since you don't look for lines anyway it might make sense to do a 'chomp @array' before you join the lines (instead of adapting the regex). You might also use non-greedy `.?` instead of `.` to make your regex a bit faster. Note that the whole string in parenthesis is captured in $1,$2.... If you only want the coding and protein sequences without the text surrounding it you have to shrink the parenthesis to sit just around the `.*` General advice: Please add at least 'use warnings;' to your code, and 'use strict;' is recommended too	[reply] [d/l] [select]
Re: extraction of sequences by BioLion (Curate) on Oct 13, 2009 at 12:17 UTC
If the above comments don't get you what you want, you might check out http://www.bioperl.org and especially the Bio::Seq realted modules. These are designed to do pretty much exactly what you want to do, i.e. munging one formatted output into another. Hope this helps Just a something something...	[reply]
Re: extraction of sequences by arun_kom (Monk) on Oct 13, 2009 at 14:00 UTC
The file you want to parse seems to be in GFF format. In that case, you could maybe try the GFF parser from BioPerl ... Bio::Tools::GFF	[reply]
Re: extraction of sequences by biohisham (Priest) on Oct 13, 2009 at 23:14 UTC
I am not familiar with the sequence format presented here to know if BioPerl modules such as `Bio::IOSeq` can handle it as it is, hence its a good idea to clean up the file first and extract only the needed data, unfortunately after having done just that , I got stuck, I feel I may need to use a data structure like hashes of arrays but I can not land on what exactly I have to do to realize this. Also, I have manually removed the part from your sample file that goes like: `# ----- prediction on sequence number 3 (length = 713, name = seq_03) +----- # # Constraints/Hints: # (none) # Predicted genes for sequence number 3 on both strands # start gene g4 .... [as same as above]......so on and on...` [download] However, following is my initial take at it, I hope a wiser monk than myself can land it at its destination so we can learn something new: #!/usr/local/bin/perl #Title "extraction of sequences" #saved the sample file in bioinfo.txt use strict; use warnings; use IO::File; my $handle = new IO::File; $handle->autoflush(1); $handle->open("<bioinfo.txt") or die("$!"); my @input_array; my @new_array; @input_array=<$handle>; @input_array = grep {s/#//g} @input_array; for (my $i=0; $i<$#input_array; $i++){ chomp $input_array[$i]; delete $input_array[$i] if $input_array[$i]=~ /((none)\|checked +\|constraints\|predicted)/i; #shedding extras next unless $input_array[$i]; #ignoring empty lines. push @new_array, $input_array[$i]; #capturing the element +s that I need } for(my $i=0;$i<$#new_array;$i++){ print "$i-$new_array[$i]\n"; #preparing for further pr +ocessing } [download] Here is the output from the snippet above: 0- ----- prediction on sequence number 1 (length = 105, name = seq_01) + -- 1- start gene g1 2- coding sequence = [atgtcgtccctccccactctcatctttctccaccc 3- atcgctgcggtcctcgccgacccttttgtgccggaagtagggaccgg] 4- protein sequence = [MTASAFVLGTVAFLHNRLRRSRPRQASTAHR 5- GTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATSSSLDLQKTNLAL 6- VAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL] 7- end gene g1 8- start gene g2 9- coding sequence = [atgccgtcctcgtcaaagcagctggcgatgcc 10- tcggcccctccttctgcaaaccgccctgccgcccgcctcggctcctccgaa 11- gccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg] 12- protein sequence =[MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQ 13- CCPGQGIRHGKWSPGSSEAPT] 14- end gene g2 15- ----- prediction on sequence number 2 (length = 710, name = seq_02 +) ----- 16- start gene g3 17- coding sequence = [agctgccctcctcggggccagccttctcttaactc 18- tttgagaccttcaatcctgaggcgtgagacgcagtctggaggagcagctc] 19- protein sequence = [LRRETQSGGAALCSLFDPPPTPTACAHANSP] [download] Out of curiosity I have translated the gene sequence and also backtranslated the proteins on ExPasy but they're giving out different results than what is in the sample file you provided and are unrelated. Any clues ? Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l] [select]
Re^2: extraction of sequences by patric (Acolyte) on Oct 14, 2009 at 03:07 UTC
hi.sorry for disappointing you. The DNA sequence and the protein sequences are different here. the actual file is really huge and has long sequences. i have just copied some random sequences here for an example. The real data of the DNA do code for its corresponding proteins. Well, anyways, thanks for trying out. thanks for the code. i will try extracting out the sequences to seperate files now. :)	[reply]


Perl Monk, Perl Meditation
	PerlMonks