Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Using Look-ahead and Look-behind teaches you how to use look-around assertions in regular expressions. This node tells you why it's sometimes not so easy to use them correctly. So if you had troubles with them, know that you're not alone ;-).

Some time ago aaaone asked an interesting question that made me think, not only about the question itself but also a bit "meta". Phrased in my own words, the question was How can one construct a regex that matches a regex $end, but ensure that there's no match of regex $a before that. For example, how can I match text enclosed in <code>...</code> tags that doesn't contain the text foo? In this case we have $a = qr{foo} and $end = qr{</code>}

I thought about it a bit, and then I had a solution. Or so I thought. While typing my reply I stopped, thought harder, and realized that the problem is much more tricky than it seems.

Here are a few attempts how to do that with negative look-ahead assertions, the approach that our fellow monk suggested in his question:

We could start with a simple regex like this:

m/^(?! $a ) .*? $end/sx

When you think a bit more about it, you'll soon realize that it doesn't work - it only checks that $a doesn't occurs at the start of the string. So we have to search for places where it matches:

m/^(?! .*? $a ) .*? $end/sx

That looks much better, but it has a pitfall: It will prevent the regex from matching even if $aonly matches after $end matched:

$_ = '...</code> ... foo'; if (m{^(?! .*? foo ) .*? </code>}sx){ print "Match\n"; } # no match

The answer to our problem is in the FAQs: we have to check at every position between start and match of $end if our assertion holds. tye pointed out (in the CB) that the solution isn't as hard as it seems:

m/^ (?: . (?! $a ) )* $end /sx

And of of course tye is right - almost. The first position of the string is still unchecked:

$_ = 'foo...</code> ... '; if (m{^(?: . (?! .*? foo ))* </code>}sx){ print "Match\n"; } # match

So the assertion has to be checked before the character is consumed:

m/^ (?: (?! $a ) . )* $end /sx

Now our assertion isn't checked after the last character before $end, but that's ok if $a doesn't match the start of what $end matches.

(To be fair to tye, he later corrected his own mistake).

So if you are fairly new to regexes, don't be disappointed if look-arounds don't seem to work the way you expect them to - they are non-trivial.

If you are a regex guru by now, and look-arounds are nothing special to you, remember that for beginners they aren't easy, and be patient with them.


In reply to Look-Arounds in Regexes are Hard by moritz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-20 15:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found