http://qs321.pair.com?node_id=782858

Using Look-ahead and Look-behind teaches you how to use look-around assertions in regular expressions. This node tells you why it's sometimes not so easy to use them correctly. So if you had troubles with them, know that you're not alone ;-).

Some time ago aaaone asked an interesting question that made me think, not only about the question itself but also a bit "meta". Phrased in my own words, the question was How can one construct a regex that matches a regex $end, but ensure that there's no match of regex $a before that. For example, how can I match text enclosed in <code>...</code> tags that doesn't contain the text foo? In this case we have $a = qr{foo} and $end = qr{</code>}

I thought about it a bit, and then I had a solution. Or so I thought. While typing my reply I stopped, thought harder, and realized that the problem is much more tricky than it seems.

Here are a few attempts how to do that with negative look-ahead assertions, the approach that our fellow monk suggested in his question:

We could start with a simple regex like this:

m/^(?! $a ) .*? $end/sx

When you think a bit more about it, you'll soon realize that it doesn't work - it only checks that $a doesn't occurs at the start of the string. So we have to search for places where it matches:

m/^(?! .*? $a ) .*? $end/sx

That looks much better, but it has a pitfall: It will prevent the regex from matching even if $aonly matches after $end matched:

$_ = '...</code> ... foo'; if (m{^(?! .*? foo ) .*? </code>}sx){ print "Match\n"; } # no match

The answer to our problem is in the FAQs: we have to check at every position between start and match of $end if our assertion holds. tye pointed out (in the CB) that the solution isn't as hard as it seems:

m/^ (?: . (?! $a ) )* $end /sx

And of of course tye is right - almost. The first position of the string is still unchecked:

$_ = 'foo...</code> ... '; if (m{^(?: . (?! .*? foo ))* </code>}sx){ print "Match\n"; } # match

So the assertion has to be checked before the character is consumed:

m/^ (?: (?! $a ) . )* $end /sx

Now our assertion isn't checked after the last character before $end, but that's ok if $a doesn't match the start of what $end matches.

(To be fair to tye, he later corrected his own mistake).

So if you are fairly new to regexes, don't be disappointed if look-arounds don't seem to work the way you expect them to - they are non-trivial.

If you are a regex guru by now, and look-arounds are nothing special to you, remember that for beginners they aren't easy, and be patient with them.

Replies are listed 'Best First'.
Re: Look-Arounds in Regexes are Hard (Common Use of a Negative Lookahead)
by ikegami (Patriarch) on Jul 24, 2009 at 08:42 UTC

    I view

    /(?:(?!$re).)*/s

    as the equivalent of

    /[^$chars]*/

    except it can prevent the match of entire regexps instead of a choice of characters. I call it the common use of a negative lookahead. If you're using a negative lookahead, that construct will probably satisfy your needs.

    Keep in mind that both expressions can successfully match 0 characters if not properly anchored.

    I think it would be nice if /(?^$re)/ could be added as a shortcut to /(?:(?!$re)(?s:.)*/. It would be more readable, and it would help prevent the common misuse of negative lookahead (/(?!$re.*)/).

    Update: Formatting.

      I really like that idea.

      There's just a minor point of bikeshedding: should it be (?s:.) instead of a simple dot?

      I can argue both ways: negated character classes include the newline (unless it's explicitly listed as excluded), and it becomes more predictable that way.

      On the other hand you gain more flexibility if respect the /s setting of the outer regex, allowing the caller to tweak its behaviour accordingly.

      The first point sounds more convincing to me, but I have to think a bit more about it.

        Fixed.
Re: Look-Arounds in Regexes are Hard
by JavaFan (Canon) on Jul 24, 2009 at 09:31 UTC
    I have found this can be a bit slow. Assuming $a doesn't start with a special regexp char, I prefer to do some loop unrolling (a technique explained in Mastering Regular Expressions (1st edition, never read the 2nd edition, I presume it to be there as well)). Say $a equals 'foo', I'd write it as:
    /^[^f]*(?:f(?!oo)[^f]*)*$end/
    This makes perl only do a lookahead on encountering an 'f' instead of any character, and reduces the need for the /s modifier.

    It may look more complex, but I've made a habit of unrolling regexp loops, and now it's almost second nature to me.

Re: Look-Arounds in Regexes are Hard
by papidave (Pilgrim) on Jul 27, 2009 at 20:57 UTC
    ++moritz for caring about beginners.

    I don't consider myself to be a newby any more, but even so, look-wherever expressions make me a bit crosseyed, and it got worse when I realized that the original question implied a fixed start and end to the pattern. When that happens, a line like

    Functions <code>abc()</code> and <code>foo()</code>
    should match the first code block, but not the second.

    If we want to use lookahead to match <code> blocks that don't contain foo, we might use something like

    m#<code>((?:(?!foo).)*)</code>#
    but you'd be wise to insert a ton of comments to clarify things*. OTOH, I find it far more readable to use something like the following:

    This gives us the desired multiple-block behaviors, while providing us with the non-foo data (in $1) for each instance, so we can do something more complicated with it if desired. In addition, the nested loop was surprisingly about 10% faster than the look-ahead code when I compared it over 100,000 iterations using Perl v5.10.0.

    *Unless you're in the UK; in that case, use a tonne of comments instead. :)

    Note: lost a slash in code cut/paste business - repaired.