http://qs321.pair.com?node_id=782858

Using Look-ahead and Look-behind teaches you how to use look-around assertions in regular expressions. This node tells you why it's sometimes not so easy to use them correctly. So if you had troubles with them, know that you're not alone ;-).

Some time ago aaaone asked an interesting question that made me think, not only about the question itself but also a bit "meta". Phrased in my own words, the question was How can one construct a regex that matches a regex $end, but ensure that there's no match of regex $a before that. For example, how can I match text enclosed in <code>...</code> tags that doesn't contain the text foo? In this case we have $a = qr{foo} and $end = qr{</code>}

I thought about it a bit, and then I had a solution. Or so I thought. While typing my reply I stopped, thought harder, and realized that the problem is much more tricky than it seems.

Here are a few attempts how to do that with negative look-ahead assertions, the approach that our fellow monk suggested in his question:

We could start with a simple regex like this:

m/^(?! $a ) .*? $end/sx

When you think a bit more about it, you'll soon realize that it doesn't work - it only checks that $a doesn't occurs at the start of the string. So we have to search for places where it matches:

m/^(?! .*? $a ) .*? $end/sx

That looks much better, but it has a pitfall: It will prevent the regex from matching even if $aonly matches after $end matched:

$_ = '...</code> ... foo'; if (m{^(?! .*? foo ) .*? </code>}sx){ print "Match\n"; } # no match

The answer to our problem is in the FAQs: we have to check at every position between start and match of $end if our assertion holds. tye pointed out (in the CB) that the solution isn't as hard as it seems:

m/^ (?: . (?! $a ) )* $end /sx

And of of course tye is right - almost. The first position of the string is still unchecked:

$_ = 'foo...</code> ... '; if (m{^(?: . (?! .*? foo ))* </code>}sx){ print "Match\n"; } # match

So the assertion has to be checked before the character is consumed:

m/^ (?: (?! $a ) . )* $end /sx

Now our assertion isn't checked after the last character before $end, but that's ok if $a doesn't match the start of what $end matches.

(To be fair to tye, he later corrected his own mistake).

So if you are fairly new to regexes, don't be disappointed if look-arounds don't seem to work the way you expect them to - they are non-trivial.

If you are a regex guru by now, and look-arounds are nothing special to you, remember that for beginners they aren't easy, and be patient with them.