Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Perl regular expression for amino acid sequence

by BrowserUk (Patriarch)
on Dec 01, 2004 at 20:39 UTC ( [id://411559]=note: print w/replies, xml ) Need Help??


in reply to Perl regular expression for amino acid sequence

#! perl -slw use strict; my $s = 'XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN'; print $s; print ' ' x( pos( $s ) - length( $1) ), $1 while $s =~ m[ ( ## Capture to $1 (?: ## A group ([QGYN]) ## of these characters (?!\2{2}) ## repeated no more than twice in successi +on ){3,6} ## 3 to 6 characters in length ? ## Remove for greedy matching. ) ]xg; ## Condensed and greedy print $s; print ' ' x( pos( $s ) - length( $1) ), $1 while $s =~ m[( (?: ([QGYN]) (?!\2{2}) ){3,6} ) ]xg; __END__ [20:37:58.32] P:\test>temp XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN QQG GYY NNQ NNQ NGG NGG XXQQGGYYNNQGYNNNNQNGGNGGNGGGQQQNNN QQGGYY NNQGY NNQNGG NGGN

Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^2: Perl regular expression for amino acid sequence
by ikegami (Patriarch) on Dec 01, 2004 at 21:04 UTC

    Your code returns

    xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNG YGY GYG NNG xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNGYGY GYGY NNG

    whereas I would have expected

    xxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx NNGYGY GYGYNN NNGNN

      I didn't read the question that way, but now you've pointed it out, yours could be, and probably is the more correct interpretation.

      If the regex engine wouldn't insist that, any reference to a previous capture, in a negative look-behind assertion, *must* be variable length (and therefore disallowed), even when the brackets referenced can only, and must, capture exactly one char.

      Then it would be easy to fix this to meet your interpretation of the problem. Alas it does, so there isn't :)

      I cannot see fix at the moment.


      Examine what is said, not who speaks.
      "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        There's a neat trick to get around the lookbehind-thinks-backreferences-are-variable-length problem: embed a lookahead in your lookbehind. It doesn't mind if you use backreferences inside the lookahead.

        For the problem at hand, it would go like this

        /([QGYN]{2} # First two characters of the desired class (?: # Followed by the complex expression... # Look for a trio starting two characters back (?<=(?!(.)\2\2)..) [QGYN] # Then take another of the desired class ){1,4} # ...1 to 4 times )/gx
        This was a trick someone posted some time ago, but it was over my head at the time. Now I get it.

        Caution: Contents may have been coded under pressure.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://411559]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-24 02:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found