in reply to Re^3: Perl regular expression for amino acid sequence
in thread Perl regular expression for amino acid sequence

There's a neat trick to get around the lookbehind-thinks-backreferences-are-variable-length problem: embed a lookahead in your lookbehind. It doesn't mind if you use backreferences inside the lookahead.

For the problem at hand, it would go like this

/([QGYN]{2} # First two characters of the desired class (?: # Followed by the complex expression... # Look for a trio starting two characters back (?<=(?!(.)\2\2)..) [QGYN] # Then take another of the desired class ){1,4} # ...1 to 4 times )/gx
This was a trick someone posted some time ago, but it was over my head at the time. Now I get it.

Caution: Contents may have been coded under pressure.

Replies are listed 'Best First'.
Re: Lookbehind and backreferences
by BrowserUk (Patriarch) on Dec 02, 2004 at 02:19 UTC

    That is neat. And very good to know. Thanks.

    The need for backrefs in a look-behind doesn't crop up that often, but when it does, I've found it almost impossible to work around--till now.

    That one definitely goes into my bag of "tricks techniques worth knowing"!

    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Lookbehind and backreferences
by seaver (Pilgrim) on Dec 02, 2004 at 15:59 UTC

    Thanks for your input, (and everybody else too) I see that you've given two slightly different solutions, am I assuming this one is THE solution?

    Since my understanding of perl regex was limited to my initial pattern, I'm not sure I understand some of the conversation that has been going on. However I realised that the length of the pattern found is a big topic, and I hand't thought about that.

    Truly the longer the pattern, the more significance. However, I am looking for repeats of patterns within a sequence, and biologically, repeats dont have to be identical, so YYGNG to me, is a repeat of YYGNN. But because variations could include other residues (it's almost the entire alphabet) it's also important that I get both short and long matches.

    I guess what I'm trying to say, is that does your solution try to make the match as long as possible?


    ps: if anyone liked this challenge of regex, here's another challenge:

    I'd wanna find /[QYGN]{4,6}/ under the same conditions, however, the solution can have one residue of ANY letter.

      This solution is functionally identical to the one I called a pure regex solution that works, so it's purely a matter of taste which you consider "THE solution".

      In both cases, they do try to make the match as long (up to six chars) as possible, though given GYNNNGYYY, you would get GYNN and NGYY rather than GYN and NNGYY. Earlier matches take all they can.

      Matching residues makes it very tricky. I will have to ponder that. Meanwhile, you might find it useful to find all your non-residue matches, and then use String::Approx to find copies of those with residues.

      Caution: Contents may have been coded under pressure.