comment on

I'm trying to use a negative lookahead assertion in a regex, and i don't understand what's going on with 1 form of the regex. I'm parsing snail mail addresses, particularly for horrible beasts like "456 4 1/2 MILE RD". For this address, the street number is 456, and the street name is "4 1/2 MILE RD" (four and one-half mile road). If the word "MILE" is not there, then the 1/2 is generally treated as a unit number. (im not going to go into the complete details) Suffice to say that i'm looking to detect a 1/2, followed by something other that MILE. So, a first attempt at a regex is like this:

   if ($address =~ /(1\/2)\s*(?!MILE)/i) {
   ...
[download]

If $address = "4 1/2 MILE RD" this evaluates to true, but i dont see why. There is a "1/2", followed by whitespace. Then there is "MILE", which evaluates to false because of the negative lookahead.

If i change the regex so the whitespace is "\s+", then the match evaluates to false, as i expected. But both * and + are greedy, so all whitespace should be sucked up either way, but with *, it should be optional. And i will have cases where there isnt any whitespace.

I used the following, which works like i want it to:

   if ($address =~ /(1\/2)(?!\s*MILE)/i) {
   ...
[download]

Myself and another programmer have absolutely no idea why the first regex doesnt work as planned, so any insight would be greatly appreciated.

BTW: this is a reduced case, yes we have other boundary conditions, i'm just perplexed about the regex behaviour

In reply to regex negative lookahead behaviour by shemp

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks