Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: extraction of sequences

by biohisham (Priest)
on Oct 13, 2009 at 23:14 UTC ( [id://800990]=note: print w/replies, xml ) Need Help??


in reply to extraction of sequences

I am not familiar with the sequence format presented here to know if BioPerl modules such as Bio::IOSeq can handle it as it is, hence its a good idea to clean up the file first and extract only the needed data, unfortunately after having done just that , I got stuck, I feel I may need to use a data structure like hashes of arrays but I can not land on what exactly I have to do to realize this. Also, I have manually removed the part from your sample file that goes like:
# ----- prediction on sequence number 3 (length = 713, name = seq_03) +----- # # Constraints/Hints: # (none) # Predicted genes for sequence number 3 on both strands # start gene g4 .... [as same as above]......so on and on...
However, following is my initial take at it, I hope a wiser monk than myself can land it at its destination so we can learn something new:
#!/usr/local/bin/perl #Title "extraction of sequences" #saved the sample file in bioinfo.txt use strict; use warnings; use IO::File; my $handle = new IO::File; $handle->autoflush(1); $handle->open("<bioinfo.txt") or die("$!"); my @input_array; my @new_array; @input_array=<$handle>; @input_array = grep {s/#//g} @input_array; for (my $i=0; $i<$#input_array; $i++){ chomp $input_array[$i]; delete $input_array[$i] if $input_array[$i]=~ /((none)|checked +|constraints|predicted)/i; #shedding extras next unless $input_array[$i]; #ignoring empty lines. push @new_array, $input_array[$i]; #capturing the element +s that I need } for(my $i=0;$i<$#new_array;$i++){ print "$i-$new_array[$i]\n"; #preparing for further pr +ocessing }

Here is the output from the snippet above:
0- ----- prediction on sequence number 1 (length = 105, name = seq_01) + -- 1- start gene g1 2- coding sequence = [atgtcgtccctccccactctcatctttctccaccc 3- atcgctgcggtcctcgccgacccttttgtgccggaagtagggaccgg] 4- protein sequence = [MTASAFVLGTVAFLHNRLRRSRPRQASTAHR 5- GTETPLLRSDKENLTTVLDATILVHSLGQKTNLALGATSSSLDLQKTNLAL 6- VAALTPGIVFPLPSPFVATGLCLQKTNLALGATSSSLDL] 7- end gene g1 8- start gene g2 9- coding sequence = [atgccgtcctcgtcaaagcagctggcgatgcc 10- tcggcccctccttctgcaaaccgccctgccgcccgcctcggctcctccgaa 11- gccgagcagcctacgcaggggccgcagatgctcgcgggagggaatatcgg] 12- protein sequence =[MPLDSSSTPTSNPAPSHSSTAYLLFERLHIAEQ 13- CCPGQGIRHGKWSPGSSEAPT] 14- end gene g2 15- ----- prediction on sequence number 2 (length = 710, name = seq_02 +) ----- 16- start gene g3 17- coding sequence = [agctgccctcctcggggccagccttctcttaactc 18- tttgagaccttcaatcctgaggcgtgagacgcagtctggaggagcagctc] 19- protein sequence = [LRRETQSGGAALCSLFDPPPTPTACAHANSP]
Out of curiosity I have translated the gene sequence and also backtranslated the proteins on ExPasy but they're giving out different results than what is in the sample file you provided and are unrelated. Any clues ?

Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.

Replies are listed 'Best First'.
Re^2: extraction of sequences
by patric (Acolyte) on Oct 14, 2009 at 03:07 UTC
    hi.sorry for disappointing you. The DNA sequence and the protein sequences are different here. the actual file is really huge and has long sequences. i have just copied some random sequences here for an example. The real data of the DNA do code for its corresponding proteins. Well, anyways, thanks for trying out. thanks for the code. i will try extracting out the sequences to seperate files now. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://800990]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-04-18 18:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found