comment on

Using Look-ahead and Look-behind teaches you how to use look-around assertions in regular expressions. This node tells you why it's sometimes not so easy to use them correctly. So if you had troubles with them, know that you're not alone ;-).

Some time ago aaaone asked an interesting question that made me think, not only about the question itself but also a bit "meta". Phrased in my own words, the question was How can one construct a regex that matches a regex $end, but ensure that there's no match of regex $a before that. For example, how can I match text enclosed in <code>...</code> tags that doesn't contain the text foo? In this case we have $a = qr{foo} and $end = qr{</code>}

I thought about it a bit, and then I had a solution. Or so I thought. While typing my reply I stopped, thought harder, and realized that the problem is much more tricky than it seems.

Here are a few attempts how to do that with negative look-ahead assertions, the approach that our fellow monk suggested in his question:

We could start with a simple regex like this:

m/^(?! $a ) .*?  $end/sx
[download]

When you think a bit more about it, you'll soon realize that it doesn't work - it only checks that $a doesn't occurs at the start of the string. So we have to search for places where it matches:

m/^(?! .*? $a ) .*? $end/sx
[download]

That looks much better, but it has a pitfall: It will prevent the regex from matching even if $aonly matches after $end matched:

$_ = '...</code> ... foo';

if (m{^(?! .*? foo ) .*? </code>}sx){
    print "Match\n";
}
# no match
[download]

The answer to our problem is in the FAQs: we have to check at every position between start and match of $end if our assertion holds. tye pointed out (in the CB) that the solution isn't as hard as it seems:

m/^ (?: . (?! $a ) )* $end /sx
[download]

And of of course tye is right - almost. The first position of the string is still unchecked:

$_ = 'foo...</code> ... ';

if (m{^(?: . (?! .*? foo ))* </code>}sx){
    print "Match\n";
}
# match
[download]

So the assertion has to be checked before the character is consumed:

m/^ (?: (?! $a ) . )* $end /sx
[download]

Now our assertion isn't checked after the last character before $end, but that's ok if $a doesn't match the start of what $end matches.

(To be fair to tye, he later corrected his own mistake).

So if you are fairly new to regexes, don't be disappointed if look-arounds don't seem to work the way you expect them to - they are non-trivial.

If you are a regex guru by now, and look-arounds are nothing special to you, remember that for beginners they aren't easy, and be patient with them.

In reply to Look-Arounds in Regexes are Hard by moritz

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl-Sensitive Sunglasses
	PerlMonks