http://qs321.pair.com?node_id=411539

seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I have this very simple pattern:

/[QGYN]{3,6}/

which i run on some 100 yeast protein sequences. The pattern does it's job.

My problem is that I want to make sure I don't get any repeats of more than 2 letters in that sequence. Meaning I don't mind seeing 'NN', but I do mind seeing 'NNN'.

What's the best way of doing this?

Thanks
Sam

Replies are listed 'Best First'.
Re: Perl regular expression for amino acid sequence
by Roy Johnson (Monsignor) on Dec 01, 2004 at 20:06 UTC
    This will come close, but will fail if the match is followed by extra repetitions.
    /(?:(?!(.)\1\1)[QGYN]){3,6}/;
    You might have to consider each character separately, which leads to a long ugly string of alternations. The first char matches your character class. The second is either not a repeat, or is a repeat followed by not a repeat. The third is either not a repeat of the second, or a repeat followed by not a repeat.

    After that, the pattern is repeated for the 4th and 5th characters, but they're all optional and nested (so if you don't have the 4th char, you don't look for the 5th). The 6th char doesn't need to check for repetitions, because it was checked by the pattern for the 5th char.

    while ($seq{$k} =~ /(([QGYN]) ((?!\2)[QGYN]|\2(?!\2)) ((?!\3)[QGYN]|\3(?!\3)) (?:((?!\4)[QGYN]|\4(?!\4)) (?:((?!\5)[QGYN]|\5(?!\5)) [QGYN]?)?)?) /xg) { print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }
    Update: adjusted to fit OP's code snippet.
    Update2: As Ikegami noted (and I noted in responding to a different post), this solution has the problem of looking too far ahead. It won't take the first two characters out of a trio. A working regex-only solution is posted as a reply to this post.

    Caution: Contents may have been coded under pressure.
      Here's a pure regex solution that works:
      use strict; use warnings; while(<DATA>) { print "$_---\n"; my $m; while (/([QGYN]{2} # First two characters of the desired class (?: # Followed by the complex expression... # Lookback at the previous two chars (?<=(.)(.)) # Check that the next char differs from at least one of th +em (?:(?!\2)|(?!\3)) [QGYN] # Then take another of the desired class ){1,4} # ...1 to 4 times )/gx) { $m = $1; printf "---> $m starting at %d\n", pos($_)-length($m); } print "=====\n"; } __DATA__ QYGNGNG GGGGGNYGNQYNNNQGYQ QGYNNN xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx

      Caution: Contents may have been coded under pressure.

        This is a very nice solution, I haven't seen that trick before.

        Hugo

      Build that programmatically.
      my $regex = '(([QGYN])'; $regex .= '((?!\\' . $_ . ')[QGYN]|\\' . $_. '(?!\\${_}))' for 2 .. 3; $regex .= '(?:((?!\\' . $_ . ')[QGYN]|\\' . $_ . '(?!\\' . $_ . '))' f +or 4 .. 5; $regex .= '[QGYN]?)?)?)'; $regex = qr/$regex/; while ($seq{$k} =~ /$regex/g) { print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n" +; }

      Now, I have no idea what all that does, but it's easily broken apart. :-)

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      Input 'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx' gives:
      NNGYGY begins at position 19 GYGYNN begins at position 32 NNGN begins at position 47

      rather than

      GNN begins at position 7 <--- NNGYGY begins at position 19 GYGYNN begins at position 32 NNGNN begins at position 47 <---
      Thank you!

      It sure did work, and it was surely surprising that it was so ugly. :-D

      Sam

Re: Perl regular expression for amino acid sequence
by dragonchild (Archbishop) on Dec 01, 2004 at 19:49 UTC
    /[QGYN]{3,6}/ && !/(.)(?=\1\1)/

    Or something like that. Instead of making it one regex, what's wrong with making it two regexes?

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      Ah, I should have explained more. I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid. Here's my full code that includes your addition

      while ($seq{$k} =~ /([QGYN]{3,6})/g) { print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }

      Thanks
      Sam

        Hi,

        Actually, I think you could use two regexs here:

        while ($seq{$k} =~ /([QGYN]{3,6})/g) { my $seq = $1; next if $seq =~ /(.)\1\1/; print "\n$k"; print "$seq begins at position ", (pos($seq{$k})-length($s)) , "\ +n"; }

        If this works for you, we could even optimize and consolidate this code a bit. I don't know where $s comes from, but I assume the lenght isn't changing any.

        my $length = length $s; # Pull this out of the loop for eff. my $sequence = $seq{$k}; while ($sequence =~ /([QGYN]{3,6})/g) { my $seq = $1; my $pos = $-[0] - $length; # @- holds the positions on the last m +atch next if $seq =~ /(.)\1\1/; print "\n$k $seq begins at position $pos\n"; }

        update: that was supposed to be print, not printf

        Note that this is untested...

        Ted Young

        ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

        I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid.

        Two regexes will work just fine. Use the first to do coarse filtering, and the second to filter.

        while ($seq{$k} =~ /([QGYN]{3,6})/g) { next if $1 =~ m/QQQ|GGG|YYY|NNN/; print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }
        This has the benefit of being blindingly obvious about what you're doing.

        Oops: ikegami is correct. This is blindingly wrong.

        This solution is actually fairly wrong since it first attempts to take from the front instead of trying to shorten the match. Of course, this is if QGNNNG would be considered series of two valid amino acids, being QGN and NNG.

        my $cur; while ($seq{$k} =~ /([QGYN]{3,6})/g) { $cur = $1; pos($seq{$k}) -= length($cur) - 1 and next if $cur =~ /(.)\1\1/; print "\n$k"; print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n +"; }
Re: Perl regular expression for amino acid sequence
by BrowserUk (Patriarch) on Dec 01, 2004 at 20:39 UTC
    #! perl -slw use strict; my $s = 'XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN'; print $s; print ' ' x( pos( $s ) - length( $1) ), $1 while $s =~ m[ ( ## Capture to $1 (?: ## A group ([QGYN]) ## of these characters (?!\2{2}) ## repeated no more than twice in successi +on ){3,6} ## 3 to 6 characters in length ? ## Remove for greedy matching. ) ]xg; ## Condensed and greedy print $s; print ' ' x( pos( $s ) - length( $1) ), $1 while $s =~ m[( (?: ([QGYN]) (?!\2{2}) ){3,6} ) ]xg; __END__ [20:37:58.32] P:\test>temp XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN QQG GYY NNQ NNQ NGG NGG XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN QQGGYY NNQGY NNQNGG NGGN

    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      Your code returns

      xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNG YGY GYG NNG xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNGYGY GYGY NNG

      whereas I would have expected

      xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNGYGY GYGYNN NNGNN

        I didn't read the question that way, but now you've pointed it out, yours could be, and probably is the more correct interpretation.

        If the regex engine wouldn't insist that, any reference to a previous capture, in a negative look-behind assertion, *must* be variable length (and therefore disallowed), even when the brackets referenced can only, and must, capture exactly one char.

        Then it would be easy to fix this to meet your interpretation of the problem. Alas it does, so there isn't :)

        I cannot see fix at the moment.


        Examine what is said, not who speaks.
        "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
        "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Perl regular expression for amino acid sequence
by Thelonius (Priest) on Dec 01, 2004 at 20:37 UTC
    Here's one way, which needs a slightly convoluted way of figuring out the original positiion:
    # break up three character repeats, inserting spaces while ($seq{$k} =~ s/([QGYN])\1\1/$1$1 $1$1/g) { } while ($seq{$k} =~ m/([QGYN]{3,6})/g) { print "Match: $1 at ", pos($seq{$k}) - length($1)-2*(substr($seq{$k}, 0, pos($seq{$k})) =~ tr/ / /), + "\n"; }
    If you already have spaces in your sequences, you'd have to use some other character.

    Updated: Changed 5 to 6. I thought the original had a "5", but it was just the tiny fonts on my monitor.

      >perl script.pl Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 ...

      It seems my Perl's tr/// clears pos for all strings. Workaround:

      use strict; use warnings; my %seq; my $k = 0; $seq{$k} = 'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxx +xx'; # break up three character repeats, inserting spaces while ($seq{$k} =~ s/([QGYN])\1\1/$1$1 $1$1/g) { } while ($seq{$k} =~ m/([QGYN]{3,5})/g) { my $saved_pos = pos($seq{$k}); printf("Match: %s at %d\n", $1, pos($seq{$k}) - length($1)-2*(substr($seq{$k}, 0, pos($seq{$k})) + =~ tr/ / /), ); pos($seq{$k}) = $saved_pos; }

      Finally, a solution that works!