Pathologically Eclectic Rubbish Lister PerlMonks

Re^3: Perl regular expression for amino acid sequence

by BrowserUk (Pope)
 on Dec 01, 2004 at 21:33 UTC ( #411583=note: print w/replies, xml ) Need Help??

I didn't read the question that way, but now you've pointed it out, yours could be, and probably is the more correct interpretation.

If the regex engine wouldn't insist that, any reference to a previous capture, in a negative look-behind assertion, *must* be variable length (and therefore disallowed), even when the brackets referenced can only, and must, capture exactly one char.

Then it would be easy to fix this to meet your interpretation of the problem. Alas it does, so there isn't :)

I cannot see fix at the moment.

Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
• Comment on Re^3: Perl regular expression for amino acid sequence

Replies are listed 'Best First'.
Lookbehind and backreferences
by Roy Johnson (Monsignor) on Dec 02, 2004 at 02:03 UTC
There's a neat trick to get around the lookbehind-thinks-backreferences-are-variable-length problem: embed a lookahead in your lookbehind. It doesn't mind if you use backreferences inside the lookahead.

For the problem at hand, it would go like this

```    /([QGYN]{2}   # First two characters of the desired class
(?:          # Followed by the complex expression...
# Look for a trio starting two characters back
(?<=(?!(.)\2\2)..)
[QGYN]     # Then take another of the desired class
){1,4}       # ...1 to 4 times
)/gx
This was a trick someone posted some time ago, but it was over my head at the time. Now I get it.

Caution: Contents may have been coded under pressure.

That is neat. And very good to know. Thanks.

The need for backrefs in a look-behind doesn't crop up that often, but when it does, I've found it almost impossible to work around--till now.

That one definitely goes into my bag of "tricks techniques worth knowing"!

Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Roy

Thanks for your input, (and everybody else too) I see that you've given two slightly different solutions, am I assuming this one is THE solution?

Since my understanding of perl regex was limited to my initial pattern, I'm not sure I understand some of the conversation that has been going on. However I realised that the length of the pattern found is a big topic, and I hand't thought about that.

Truly the longer the pattern, the more significance. However, I am looking for repeats of patterns within a sequence, and biologically, repeats dont have to be identical, so YYGNG to me, is a repeat of YYGNN. But because variations could include other residues (it's almost the entire alphabet) it's also important that I get both short and long matches.

I guess what I'm trying to say, is that does your solution try to make the match as long as possible?

Thanks
Sam

ps: if anyone liked this challenge of regex, here's another challenge:

I'd wanna find /[QYGN]{4,6}/ under the same conditions, however, the solution can have one residue of ANY letter.

This solution is functionally identical to the one I called a pure regex solution that works, so it's purely a matter of taste which you consider "THE solution".

In both cases, they do try to make the match as long (up to six chars) as possible, though given GYNNNGYYY, you would get GYNN and NGYY rather than GYN and NNGYY. Earlier matches take all they can.

Matching residues makes it very tricky. I will have to ponder that. Meanwhile, you might find it useful to find all your non-residue matches, and then use String::Approx to find copies of those with residues.

Caution: Contents may have been coded under pressure.

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://411583]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2021-12-06 05:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
R or B?

Results (31 votes). Check out past polls.

Notices?