First, let me say I know nothing about Orf sequences...
A very quick search of CPAN threw up this bioperl examples/longorf.pl which might be worth a look ? Again, I don't know whether bioperl is useful... but you might, at least, get some ideas from looking at some of the code there ?
Looking at the fragment of code you have so far... [and it would be easier to do that if (a) it was enclosed in <code> tags, (b) was runnable, (c) had some sample data with it, (d) a description of what was expected, and (e) almost anything that allowed a humble programmer to understand what was required.]
...as far as I can see, you've collected possible start positions in @startsRF1 and stop positions in @stopsRF1 -- these positions are marked by certain 3 character sequences, which are constrained to appear at three character boundaries. Now you want to process stuff between those start and stop positions. Because of the way they've been collected, those arrays are in ascending order of string position, which is a start. Now:
can what you want to process include one or more start and/or end positions ? So, if the starts are: (6, 36, 69) and the ends (42, 57, 90), do you want to look at: (6..42, 6..57, 6..90, 36..42, 36..90, 69..90), or (36..42, 36..57, 69..90), or just (36..42, 69..90) ?
do the start and end of the string count as start and end positions ?
Whatever the answers to the above, the simple approach is two foreach loops, the outer cycling through the start positions and the inner the end positions, deciding which start..end combinations to consider. Inside all that you can extract the substring using substr. Then ... I dunno; I regret I don't know what a protein sequence looks like.
If you have huge numbers of start and end positions, and depending on the answers to the above, you may want a more cunning approach, to speed things up. What I have suggested above is O(n^2), which is fine for little problems, and (frankly) horrible for big ones. But, never optimise until you have to -- and even then, think twice.
| [reply] [Watch: Dir/Any] [d/l] [select] |
#!/usr/bin/perl
use strict;
use warnings;
my(@codons)= qw(ATG GTG);
my $dna = "AAAATGGGGTAAGTGAACGGGTAA";
my $splitter= join('|', @codons);
my @sequences= split /($splitter)/,$dna;
shift @sequences;
my $codon= 1;
foreach (@sequences) {
if ($codon) {
print $_,"-";
}
else {
print $_,"\n"
}
$codon= not $codon;
}
output
ATG-GGGTAA
GTG-AACGGGTAA
it works by splitting at codons, but capturing them. Then discarding the first (possibly empty) ouput of split and putting together every two elements of split's output.
s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
| [reply] [Watch: Dir/Any] [d/l] [select] |
| [reply] [Watch: Dir/Any] |
You could do this using a regular expression with two capturing groups, see perlretut and perlre. There are probably lots of modules out there designed for just this sort of thing.
$ perl -le '
> $seq = q{AAAATGGGGTAAGTGAACGGGTAA};
> $start = q{ATG};
> $stop = q{GTG};
> ( $prot1, $prot2 ) =
> $seq =~ m{(${start}[ACGT]*?)(${stop}[ACGT]*)};
> print qq{$prot1\n$prot2\n};'
ATGGGGTAA
GTGAACGGGTAA
$
I hope this is of use.
Cheers, JohnGG | [reply] [Watch: Dir/Any] [d/l] |