Re: Perl regular expression for amino acid sequence
by Roy Johnson (Monsignor) on Dec 01, 2004 at 20:06 UTC
|
This will come close, but will fail if the match is followed by extra repetitions.
/(?:(?!(.)\1\1)[QGYN]){3,6}/;
You might have to consider each character separately, which leads to a long ugly string of alternations. The first char matches your character class. The second is either not a repeat, or is a repeat followed by not a repeat. The third is either not a repeat of the second, or a repeat followed by not a repeat.
After that, the pattern is repeated for the 4th and 5th characters, but they're all optional and nested (so if you don't have the 4th char, you don't look for the 5th). The 6th char doesn't need to check for repetitions, because it was checked by the pattern for the 5th char.
while ($seq{$k} =~ /(([QGYN])
((?!\2)[QGYN]|\2(?!\2))
((?!\3)[QGYN]|\3(?!\3))
(?:((?!\4)[QGYN]|\4(?!\4))
(?:((?!\5)[QGYN]|\5(?!\5))
[QGYN]?)?)?)
/xg) {
print "\n$k";
print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n";
}
Update: adjusted to fit OP's code snippet.
Update2: As Ikegami noted (and I noted in responding to a different post), this solution has the problem of looking too far ahead. It won't take the first two characters out of a trio. A working regex-only solution is posted as a reply to this post.
Caution: Contents may have been coded under pressure.
| [reply] [d/l] [select] |
|
Here's a pure regex solution that works:
use strict;
use warnings;
while(<DATA>) {
print "$_---\n";
my $m;
while (/([QGYN]{2} # First two characters of the desired class
(?: # Followed by the complex expression...
# Lookback at the previous two chars
(?<=(.)(.))
# Check that the next char differs from at least one of th
+em
(?:(?!\2)|(?!\3))
[QGYN] # Then take another of the desired class
){1,4} # ...1 to 4 times
)/gx) {
$m = $1;
printf "---> $m starting at %d\n", pos($_)-length($m);
}
print "=====\n";
}
__DATA__
QYGNGNG
GGGGGNYGNQYNNNQGYQ
QGYNNN
xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
|
| [reply] |
|
Build that programmatically.
my $regex = '(([QGYN])';
$regex .= '((?!\\' . $_ . ')[QGYN]|\\' . $_. '(?!\\${_}))' for 2 .. 3;
$regex .= '(?:((?!\\' . $_ . ')[QGYN]|\\' . $_ . '(?!\\' . $_ . '))' f
+or 4 .. 5;
$regex .= '[QGYN]?)?)?)';
$regex = qr/$regex/;
while ($seq{$k} =~ /$regex/g)
{
print "\n$k";
print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"
+;
}
Now, I have no idea what all that does, but it's easily broken apart. :-)
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] [d/l] |
|
Input 'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx' gives:
NNGYGY begins at position 19
GYGYNN begins at position 32
NNGN begins at position 47
rather than
GNN begins at position 7 <---
NNGYGY begins at position 19
GYGYNN begins at position 32
NNGNN begins at position 47 <---
| [reply] [d/l] [select] |
|
| [reply] |
Re: Perl regular expression for amino acid sequence
by dragonchild (Archbishop) on Dec 01, 2004 at 19:49 UTC
|
/[QGYN]{3,6}/ && !/(.)(?=\1\1)/
Or something like that. Instead of making it one regex, what's wrong with making it two regexes?
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] [d/l] |
|
Ah, I should have explained more. I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid. Here's my full code that includes your addition
while ($seq{$k} =~ /([QGYN]{3,6})/g) {
print "\n$k";
print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n";
}
Thanks
Sam | [reply] [d/l] |
|
while ($seq{$k} =~ /([QGYN]{3,6})/g) {
my $seq = $1;
next if $seq =~ /(.)\1\1/;
print "\n$k";
print "$seq begins at position ", (pos($seq{$k})-length($s)) , "\
+n";
}
If this works for you, we could even optimize and consolidate this code a bit. I don't know where $s comes from, but I assume the lenght isn't changing any.
my $length = length $s; # Pull this out of the loop for eff.
my $sequence = $seq{$k};
while ($sequence =~ /([QGYN]{3,6})/g) {
my $seq = $1;
my $pos = $-[0] - $length; # @- holds the positions on the last m
+atch
next if $seq =~ /(.)\1\1/;
print "\n$k $seq begins at position $pos\n";
}
update: that was supposed to be print, not printf
Note that this is untested...
Ted Young
($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)
| [reply] [d/l] [select] |
|
|
I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid.
Two regexes will work just fine. Use the first to do coarse filtering, and the second to filter.
while ($seq{$k} =~ /([QGYN]{3,6})/g) {
next if $1 =~ m/QQQ|GGG|YYY|NNN/;
print "\n$k";
print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n";
}
This has the benefit of being blindingly obvious about what you're doing.
Oops: ikegami is correct. This is blindingly wrong.
| [reply] [d/l] |
|
|
This solution is actually fairly wrong since it first attempts to take from the front instead of trying to shorten the match. Of course, this is if QGNNNG would be considered series of two valid amino acids, being QGN and NNG.
my $cur;
while ($seq{$k} =~ /([QGYN]{3,6})/g) {
$cur = $1;
pos($seq{$k}) -= length($cur) - 1 and next if $cur =~ /(.)\1\1/;
print "\n$k";
print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n
+";
}
| [reply] [d/l] |
|
Re: Perl regular expression for amino acid sequence
by BrowserUk (Patriarch) on Dec 01, 2004 at 20:39 UTC
|
#! perl -slw
use strict;
my $s = 'XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN';
print $s;
print ' ' x( pos( $s ) - length( $1) ), $1
while $s =~ m[
( ## Capture to $1
(?: ## A group
([QGYN]) ## of these characters
(?!\2{2}) ## repeated no more than twice in successi
+on
){3,6} ## 3 to 6 characters in length
? ## Remove for greedy matching.
)
]xg;
## Condensed and greedy
print $s;
print ' ' x( pos( $s ) - length( $1) ), $1
while $s =~ m[( (?: ([QGYN]) (?!\2{2}) ){3,6} ) ]xg;
__END__
[20:37:58.32] P:\test>temp
XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN
QQG
GYY
NNQ
NNQ
NGG
NGG
XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN
QQGGYY
NNQGY
NNQNGG
NGGN
Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] |
|
xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx
NNG
YGY
GYG
NNG
xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx
NNGYGY
GYGY
NNG
whereas I would have expected
xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx
NNGYGY
GYGYNN
NNGNN
| [reply] [d/l] [select] |
|
I didn't read the question that way, but now you've pointed it out, yours could be, and probably is the more correct interpretation.
If the regex engine wouldn't insist that, any reference to a previous capture, in a negative look-behind assertion, *must* be variable length (and therefore disallowed), even when the brackets referenced can only, and must, capture exactly one char.
Then it would be easy to fix this to meet your interpretation of the problem. Alas it does, so there isn't :)
I cannot see fix at the moment.
Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] |
|
|
|
|
Re: Perl regular expression for amino acid sequence
by Thelonius (Priest) on Dec 01, 2004 at 20:37 UTC
|
Here's one way, which needs a slightly convoluted way of figuring out the original positiion:
# break up three character repeats, inserting spaces
while ($seq{$k} =~ s/([QGYN])\1\1/$1$1 $1$1/g) { }
while ($seq{$k} =~ m/([QGYN]{3,6})/g) {
print "Match: $1 at ", pos($seq{$k})
- length($1)-2*(substr($seq{$k}, 0, pos($seq{$k})) =~ tr/ / /),
+ "\n";
}
If you already have spaces in your sequences, you'd have to use some other character.
Updated: Changed 5 to 6. I thought the original had a "5", but it was just the tiny fonts on my monitor.
| [reply] [d/l] |
|
>perl script.pl
Match: GNN at 7
Match: GNN at 7
Match: GNN at 7
Match: GNN at 7
Match: GNN at 7
Match: GNN at 7
Match: GNN at 7
Match: GNN at 7
...
It seems my Perl's tr/// clears pos for all strings. Workaround:
use strict;
use warnings;
my %seq;
my $k = 0;
$seq{$k} = 'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxx
+xx';
# break up three character repeats, inserting spaces
while ($seq{$k} =~ s/([QGYN])\1\1/$1$1 $1$1/g) { }
while ($seq{$k} =~ m/([QGYN]{3,5})/g) {
my $saved_pos = pos($seq{$k});
printf("Match: %s at %d\n",
$1,
pos($seq{$k}) - length($1)-2*(substr($seq{$k}, 0, pos($seq{$k}))
+ =~ tr/ / /),
);
pos($seq{$k}) = $saved_pos;
}
Finally, a solution that works!
| [reply] [d/l] [select] |