regex negative lookahead behaviour

shemp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex negative lookahead behaviour by broquaint (Abbot) on Jul 18, 2003 at 17:14 UTC
This is due to the nature of `` and while it is a greedy quantifier, it is also quite a lazy quantifier. In this case `\s` doesn't match any space, since it doesn't have to, and subsequently " `MILE`" is matched which doesn't match "`MILE`" so it's a successful match. Hopefully this quick oneliner and it's output should make that explanation clearer shell> perl -Mre=debug -e '$_="1/2 MILE"; m{ 1/2 \s* (?!MILE) }x' Freeing REx: `,' Compiling REx ` 1/2 \s* (?!MILE) ' size 11 first at 1 1: EXACT <1/2>(3) 3: STAR(5) 4: SPACE(0) 5: UNLESSM[-0](11) 7: EXACT <MILE>(9) 9: SUCCEED(0) 10: TAIL(11) 11: END(0) anchored `1/2' at 0 (checking anchored) minlen 3 Guessing start of match, REx ` 1/2 \s* (?!MILE) ' against `1/2 MILE'.. +. Found anchored substr `1/2' at offset 0... Guessed: match at offset 0 Matching REx ` 1/2 \s* (?!MILE) ' against `1/2 MILE' Setting an EVAL scope, savestack=3 0 <> <1/2 MILE> \| 1: EXACT <1/2> 3 <1/2> < MILE> \| 3: STAR SPACE can match 1 times out of 32767... Setting an EVAL scope, savestack=3 4 <1/2 > <MILE> \| 5: UNLESSM[-0] 4 <1/2 > <MILE> \| 7: EXACT <MILE> 8 <1/2 MILE> <> \| 9: SUCCEED could match... failed... 3 <1/2> < MILE> \| 5: UNLESSM[-0] 3 <1/2> < MILE> \| 7: EXACT <MILE> failed... 3 <1/2> < MILE> \| 11: END Match successful! Freeing REx: ` 1/2 \s* (?!MILE) ' [download] HTH `_________ broquaint`	[reply] [d/l]
Re: regex negative lookahead behaviour by BrowserUk (Patriarch) on Jul 18, 2003 at 17:08 UTC
The problem is that whilst \s* is greedy, it will also match zero spaces, which means that '1/2' followed by zero spaces is not followed by 'MILE', because it is followed by ' MILE' which satisfies the condition, so the expression is true. The only thing that overrides a regexes natural greediness is it's desire to achieve a match. If it can, it will, and it can, so it does:) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply]
Re: Re: regex negative lookahead behaviour by pijll (Beadle) on Jul 19, 2003 at 11:49 UTC
The only thing that overrides a regexes natural greediness is it's desire to achieve a match. Normal regexes are greedy, but not selfish; they are willing to give characters back if it's for the greater good. But you can use the (?>pattern) construct to create "selfish" regexes: they only try to match once at every position. If the pattern inside is greedy, it will never give up chars to make the entire regex match; if the pattern is lazy, it will never match more than the absolute minimum of chars at that position. So `"1/2 MILE" =~ m[1/2(?>\s*)(?!MILE)]` should work.	[reply] [d/l]
Re: regex negative lookahead behaviour by sauoq (Abbot) on Jul 18, 2003 at 17:15 UTC
You have to remember that the regex engine tries as hard as it can to make the match work. `"4 1/2 MILE RD" =~ m!(1/2)\s(?!MILE)!;` [download] First, the engine slurps up all the space after the "1/2" with the \s and finds that it doesn't match because of your assertion. Next, it backtracks by slurping all but one space with the \s*. At that point, your assertion "MILE" doesn't match " MILE" (with a leading space) so it is successful. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: regex negative lookahead behaviour by dws (Chancellor) on Jul 18, 2003 at 19:34 UTC
I'm parsing snail mail addresses, particularly for horrible beasts like "456 4 1/2 MILE RD". Meta advice: I've been involved with two projects over the years that have had to parse names and addresses. In both cases, what we ended up with was a system that got ~98% right automatically, then kicked out the remaining 2% for vetting by a human. It's a diminishing returns problem: If you model the economics, at some point you have to stop pouring your effort into matching the remaining pool of difficult addresses, and let a human being do it.	[reply]
Re: Re: regex negative lookahead behaviour by BazB (Priest) on Jul 20, 2003 at 11:19 UTC
Unfortunately, dws's approach isn't always possible. The system I work on handles significant volumes of addresses, using dedicated (commercial) software to handle the identification and validation of that information. There is inhouse processing to help smooth things out, but not every problem can be catered for. The volumes involved prevent it being practial for humans to process the problem records (and in fact some of those problem records are a direct result of human input). Even if the volumes were low enough for humans to be able to process exceptions, humans can't get it right all of the time. This might be because of a lack of information, poorly laid of information or just human error. As shemp describes, even reference data used in such validation systems isn't perfect, and this sets the upper limit to what you can reasonably expect to acheieve. I'd say that you can never expect to get things 100% right, and it might end up being cheaper and/or easier to accept a certain error rate. Of course, your client may not accept this, but that's a whole other problem :-) Cheers. BazB If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong. That way everyone learns.	[reply]
Re: regex negative lookahead behaviour by flounder99 (Friar) on Jul 18, 2003 at 18:51 UTC
I think your regex question was answered quite well but I wanted to remind you of some other hairy boundary conditions: A friend of mine lived in a duplex when he was in college and his address was "36 1/2 Garfield St." So if he lived on "4 1/2 MILE RD" his address would have been "36 1/2 4 1/2 MILE RD" -- YIPES!!! Is there a "1/2 MILE RD"? A duplex there would have addresses like "36 1/2 1/2 MILE RD" -- Double YIPES!!! What if someone builds a duplex on "MILES RD" What would your regex do with "36 1/2 MILES RD"? -- did i mention YIPES!!! You probably only have a fixed number of records to process and my suggestions are irrelevant but trying to process addresses in general is very hard. -- flounder	[reply]
Re: Re: regex negative lookahead behaviour by shemp (Deacon) on Jul 18, 2003 at 19:34 UTC
Yes, we have actually seen some pretty ugly stuff like you mention. Equally bad is "1/2 street" or even "1/8 street". I work for a private company that consolidates public record data for about 1/3 of wisconsin (and growing). Unfortunately, addresses are quite ugly, and we need to build special cases for junk like this thread is about. We currently have a few thousand lines of perl to parse addresses, and it's integrated with a CASS certified (U.S. postal certification) address correction software, but it's still a big pain. The worst part about the whole deal is that the government raw data is sometimes truly ambiguous, such as "814 E main st". There's generally no good way to tell whether the "E" should be a unit or a street directional. So then we try both cases against the address correction software, and hopefully only 1 comes back as a match. Not nearly as bad as a few years ago, when i was maintaining legacy C++ code to try to do this! Yes we do only have a fixed number of records, currently about 3 million.	[reply]
Re: regex negative lookahead behaviour by jonadab (Parson) on Jul 19, 2003 at 13:46 UTC
This specific case has been answered for you quite adequately, but I just wanted to point out that you could have figured this out by looking at exactly WHAT the regex was matching. If you'd written a test script to quote and output what was matched, you'd have seen that the space wasn't matched, which might have led you to the answer faster than waiting for the monks to reply. Quidquid latine dictum sit altum viditur.	[reply]
Re: regex negative lookahead behaviour by princepawn (Parson) on Jul 18, 2003 at 18:10 UTC
The only thing that overrides a regexes natural greediness is it's desire to achieve a match. If it can, it will, and it can, so it does:) How about putting a `+`after the quantifier? Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality	[reply] [d/l]
Re: Re: regex negative lookahead behaviour by sauoq (Abbot) on Jul 18, 2003 at 18:59 UTC
How about putting a +after the quantifier? Then it breaks as soon as someone typos an extra space into the address. So, you can only do that if you can assume spaces have been normalized and there will be only one... But, if you can make that assumption, then there'd be no reason for a quantifier at all. :-) -sauoq "My two cents aren't worth a dime.";	[reply]