Perl regular expression for amino acid sequence

seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I have this very simple pattern:

/[QGYN]{3,6}/
[download]

which i run on some 100 yeast protein sequences. The pattern does it's job.

My problem is that I want to make sure I don't get any repeats of more than 2 letters in that sequence. Meaning I don't mind seeing 'NN', but I do mind seeing 'NNN'.

What's the best way of doing this?

Thanks
Sam

Comment on Perl regular expression for amino acid sequence Download Code

Replies are listed 'Best First'.
Re: Perl regular expression for amino acid sequence by Roy Johnson (Monsignor) on Dec 01, 2004 at 20:06 UTC
This will come close, but will fail if the match is followed by extra repetitions. `/(?:(?!(.)\1\1)[QGYN]){3,6}/;` [download] You might have to consider each character separately, which leads to a long ugly string of alternations. The first char matches your character class. The second is either not a repeat, or is a repeat followed by not a repeat. The third is either not a repeat of the second, or a repeat followed by not a repeat. After that, the pattern is repeated for the 4th and 5th characters, but they're all optional and nested (so if you don't have the 4th char, you don't look for the 5th). The 6th char doesn't need to check for repetitions, because it was checked by the pattern for the 5th char. `while ($seq{$k} =~ /(([QGYN]) ((?!\2)[QGYN]\|\2(?!\2)) ((?!\3)[QGYN]\|\3(?!\3)) (?:((?!\4)[QGYN]\|\4(?!\4)) (?:((?!\5)[QGYN]\|\5(?!\5)) [QGYN]?)?)?) /xg) { print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }` [download] Update: adjusted to fit OP's code snippet. Update2: As Ikegami noted (and I noted in responding to a different post), this solution has the problem of looking too far ahead. It won't take the first two characters out of a trio. A working regex-only solution is posted as a reply to this post. Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re^2: Perl regular expression for amino acid sequence by Roy Johnson (Monsignor) on Dec 01, 2004 at 21:45 UTC
Here's a pure regex solution that works: use strict; use warnings; while(<DATA>) { print "$_---\n"; my $m; while (/([QGYN]{2} # First two characters of the desired class (?: # Followed by the complex expression... # Lookback at the previous two chars (?<=(.)(.)) # Check that the next char differs from at least one of th +em (?:(?!\2)\|(?!\3)) [QGYN] # Then take another of the desired class ){1,4} # ...1 to 4 times )/gx) { $m = $1; printf "---> $m starting at %d\n", pos($_)-length($m); } print "=====\n"; } __DATA__ QYGNGNG GGGGGNYGNQYNNNQGYQ QGYNNN xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^3: Perl regular expression for amino acid sequence by hv (Prior) on Dec 02, 2004 at 13:34 UTC
This is a very nice solution, I haven't seen that trick before. Hugo	[reply]
Re^2: Perl regular expression for amino acid sequence by dragonchild (Archbishop) on Dec 01, 2004 at 20:47 UTC
Build that programmatically. `my $regex = '(([QGYN])'; $regex .= '((?!\\' . $_ . ')[QGYN]\|\\' . $_. '(?!\\${_}))' for 2 .. 3; $regex .= '(?:((?!\\' . $_ . ')[QGYN]\|\\' . $_ . '(?!\\' . $_ . '))' f +or 4 .. 5; $regex .= '[QGYN]?)?)?)'; $regex = qr/$regex/; while ($seq{$k} =~ /$regex/g) { print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n" +; }` [download] Now, I have no idea what all that does, but it's easily broken apart. :-) Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l]
Re^2: Perl regular expression for amino acid sequence by ikegami (Patriarch) on Dec 01, 2004 at 21:11 UTC
Input `'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx'` gives: `NNGYGY begins at position 19 GYGYNN begins at position 32 NNGN begins at position 47` [download] rather than `GNN begins at position 7 <--- NNGYGY begins at position 19 GYGYNN begins at position 32 NNGNN begins at position 47 <---` [download]	[reply] [d/l] [select]
Re^2: Perl regular expression for amino acid sequence by seaver (Pilgrim) on Dec 01, 2004 at 20:27 UTC
Thank you! It sure did work, and it was surely surprising that it was so ugly. :-D Sam	[reply]
Re: Perl regular expression for amino acid sequence by dragonchild (Archbishop) on Dec 01, 2004 at 19:49 UTC
`/[QGYN]{3,6}/ && !/(.)(?=\1\1)/` [download] Or something like that. Instead of making it one regex, what's wrong with making it two regexes? Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l]
Re^2: Perl regular expression for amino acid sequence by seaver (Pilgrim) on Dec 01, 2004 at 20:01 UTC
Ah, I should have explained more. I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid. Here's my full code that includes your addition `while ($seq{$k} =~ /([QGYN]{3,6})/g) { print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }` [download] Thanks Sam	[reply] [d/l]
Re^3: Perl regular expression for amino acid sequence by TedYoung (Deacon) on Dec 01, 2004 at 20:12 UTC
Hi, Actually, I think you could use two regexs here: `while ($seq{$k} =~ /([QGYN]{3,6})/g) { my $seq = $1; next if $seq =~ /(.)\1\1/; print "\n$k"; print "$seq begins at position ", (pos($seq{$k})-length($s)) , "\ +n"; }` [download] If this works for you, we could even optimize and consolidate this code a bit. I don't know where $s comes from, but I assume the lenght isn't changing any. `my $length = length $s; # Pull this out of the loop for eff. my $sequence = $seq{$k}; while ($sequence =~ /([QGYN]{3,6})/g) { my $seq = $1; my $pos = $-[0] - $length; # @- holds the positions on the last m +atch next if $seq =~ /(.)\1\1/; print "\n$k $seq begins at position $pos\n"; }` [download] update: that was supposed to be print, not printf Note that this is untested... Ted Young `($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)`	[reply] [d/l] [select]
Re^4: Perl regular expression for amino acid sequence by Roy Johnson (Monsignor) on Dec 01, 2004 at 20:17 UTC
Re^3: Perl regular expression for amino acid sequence by dws (Chancellor) on Dec 01, 2004 at 21:13 UTC
I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid. Two regexes will work just fine. Use the first to do coarse filtering, and the second to filter. `while ($seq{$k} =~ /([QGYN]{3,6})/g) { next if $1 =~ m/QQQ\|GGG\|YYY\|NNN/; print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }` [download] This has the benefit of being blindingly obvious about what you're doing. Oops: ikegami is correct. This is blindingly wrong.	[reply] [d/l]
Re^4: Perl regular expression for amino acid sequence by ikegami (Patriarch) on Dec 01, 2004 at 21:25 UTC
Re^3: Perl regular expression for amino acid sequence by !1 (Hermit) on Dec 01, 2004 at 20:37 UTC
This solution is actually fairly wrong since it first attempts to take from the front instead of trying to shorten the match. Of course, this is if QGNNNG would be considered series of two valid amino acids, being QGN and NNG. `my $cur; while ($seq{$k} =~ /([QGYN]{3,6})/g) { $cur = $1; pos($seq{$k}) -= length($cur) - 1 and next if $cur =~ /(.)\1\1/; print "\n$k"; print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n +"; }` [download]	[reply] [d/l]
Re^4: Perl regular expression for amino acid sequence by Roy Johnson (Monsignor) on Dec 01, 2004 at 20:53 UTC
Re: Perl regular expression for amino acid sequence by BrowserUk (Patriarch) on Dec 01, 2004 at 20:39 UTC
#! perl -slw use strict; my $s = 'XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN'; print $s; print ' ' x( pos( $s ) - length( $1) ), $1 while $s =~ m[ ( ## Capture to $1 (?: ## A group ([QGYN]) ## of these characters (?!\2{2}) ## repeated no more than twice in successi +on ){3,6} ## 3 to 6 characters in length ? ## Remove for greedy matching. ) ]xg; ## Condensed and greedy print $s; print ' ' x( pos( $s ) - length( $1) ), $1 while $s =~ m[( (?: ([QGYN]) (?!\2{2}) ){3,6} ) ]xg; __END__ [20:37:58.32] P:\test>temp XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN QQG GYY NNQ NNQ NGG NGG XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN QQGGYY NNQGY NNQNGG NGGN [download] Examine what is said, not who speaks. "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^2: Perl regular expression for amino acid sequence by ikegami (Patriarch) on Dec 01, 2004 at 21:04 UTC
Your code returns `xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNG YGY GYG NNG xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNGYGY GYGY NNG` [download] whereas I would have expected `xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNGYGY GYGYNN NNGNN` [download]	[reply] [d/l] [select]
Re^3: Perl regular expression for amino acid sequence by BrowserUk (Patriarch) on Dec 01, 2004 at 21:33 UTC
I didn't read the question that way, but now you've pointed it out, yours could be, and probably is the more correct interpretation. If the regex engine wouldn't insist that, any reference to a previous capture, in a negative look-behind assertion, must be variable length (and therefore disallowed), even when the brackets referenced can only, and must, capture exactly one char. Then it would be easy to fix this to meet your interpretation of the problem. Alas it does, so there isn't :) I cannot see fix at the moment. Examine what is said, not who speaks. "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Lookbehind and backreferences by Roy Johnson (Monsignor) on Dec 02, 2004 at 02:03 UTC
Re: Lookbehind and backreferences by BrowserUk (Patriarch) on Dec 02, 2004 at 02:19 UTC
Re: Lookbehind and backreferences by seaver (Pilgrim) on Dec 02, 2004 at 15:59 UTC
Some notes below your chosen depth have not been shown here
Re: Perl regular expression for amino acid sequence by Thelonius (Priest) on Dec 01, 2004 at 20:37 UTC
Here's one way, which needs a slightly convoluted way of figuring out the original positiion: `# break up three character repeats, inserting spaces while ($seq{$k} =~ s/([QGYN])\1\1/$1$1 $1$1/g) { } while ($seq{$k} =~ m/([QGYN]{3,6})/g) { print "Match: $1 at ", pos($seq{$k}) - length($1)-2(substr($seq{$k}, 0, pos($seq{$k})) =~ tr/ / /), + "\n"; }` [download] If you already have spaces in your sequences, you'd have to use some other character. Updated:* Changed 5 to 6. I thought the original had a "5", but it was just the tiny fonts on my monitor.	[reply] [d/l]
Re^2: Perl regular expression for amino acid sequence by ikegami (Patriarch) on Dec 01, 2004 at 21:21 UTC
`>perl script.pl Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 ...` [download] It seems my Perl's tr/// clears pos for all strings. Workaround: `use strict; use warnings; my %seq; my $k = 0; $seq{$k} = 'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxx +xx'; # break up three character repeats, inserting spaces while ($seq{$k} =~ s/([QGYN])\1\1/$1$1 $1$1/g) { } while ($seq{$k} =~ m/([QGYN]{3,5})/g) { my $saved_pos = pos($seq{$k}); printf("Match: %s at %d\n", $1, pos($seq{$k}) - length($1)-2*(substr($seq{$k}, 0, pos($seq{$k})) + =~ tr/ / /), ); pos($seq{$k}) = $saved_pos; }` [download] Finally, a solution that works!	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom