Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Regex confusion

by scottb (Scribe)
on Dec 24, 2004 at 20:58 UTC ( [id://417348]=perlquestion: print w/replies, xml ) Need Help??

scottb has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm writing some rather large regex's and I've run into a bit of an interesting situation. Through testing I'm able to figure out how to accomplish what I want, but I still don't understand why the regex behaves the way it does. So I'm searching for that answer. I am evaluating the data with regex's compiled with //msgx. Here is some test data I created for this example:
int A Number of Flaps: 6 IP Address: 2.2.2.2 int B Number of Flaps: 8 IP Address 5.5.5.5 int C Number of Flaps: 9 IP Address 9.9.9.9
Now I need for data for each 'int' to be optional, that is the line doesn't have to exist at all and in which case the array returned by evaluating the regex will contain an undef value in it's place. In other words, an interface doesn't neccesarily have to have a line dictating the number of flaps, and doesn't neccesarily have to have an IP address. I'd therefore come up with the following regex:
^int\s+(\w+)\s* (?:^\s*Number\sof\sFlaps:\s(\d+)\s*)? (?:^\s*IP\sAddress:?\s(\S*)\s*)?
However, this doesn't find either of the flaps or the IP address. Interestingly though, in order to 'fix' the regular expression so that it does match, I merely need to remove the caret indicating that the optional component needs to be at the beginning of the line, and it matches, like so:
^int\s+(\w+)\s* (?:\s*Number\sof\sFlaps:\s(\d+)\s*)? (?:\s*IP\sAddress:?\s(\S*)\s*)?
That works. With that, I find all three interfaces and their flaps and IP address. Another way I can make the regex find everything in this example (although it doesn't work for my real data in which the components are optional), is to remove the (?: )? clustering and force it to match that way:
^int\s+(\w+)\s* ^\s*Number\sof\sFlaps:\s(\d+)\s* ^\s*IP\sAddress:?\s(\S*)\s*
Not really relivent in my situation, but interesting while I am attempting to understand the problem.

So the question is, what is it about the caret in my original regex that breaks the matching and makes the optional components not greedy? What do I mean by "not greedy"? Well, consider the following regex with a zero-width positive look-ahead assertion at the end to "pull down" the option components and try to force them to match:

^int\s+(\w+)\s* (?:^\s*Number\sof\sFlaps:\s(\d+)\s*)? (?:^\s*IP\sAddress:?\s(\S*)\s*)? (?=^int|\Z)
That works too. It's the same regular expression before the zero-width positive look-ahead assertion. It surprises and stupifies me that it doesn't work without it because I would expect the optional components to try to match if they can. I mean it's not as if I had used (?: )?? to make it less greedy, but it seems to be behaving that way.

Anyone who can clue me in would be greatly appreciated.
Happy Holidays!

- Scott

Replies are listed 'Best First'.
Re: Regex confusion
by MarkusLaker (Beadle) on Dec 24, 2004 at 21:54 UTC
    I wish I could remember who originally said this: regular expressions may be greedy, but they are not into deferred gratification.

    Let's consider what happens when we apply this regex --

    ^int\s+(\w+)\s* (?:^\s*Number\sof\sFlaps:\s(\d+)\s*)? (?:^\s*IP\sAddress:?\s(\S*)\s*)?

    -- to this string:

    int A Number of Flaps: 6 IP Address: 2.2.2.2

    The \s* in the first line of the regex eats up any spaces at the end of the first line of input text, plus the newline, plus the indent at the start of the second line. Since the rest of the regex is inside (?: ... )? brackets, the regex can match successfully without backtracking, and that's exactly what it does -- without matching the rest of the input string.

    Now let's remove the question marks at the end of the second and third lines of the regex. This forces the regex engine to match the text inside the brackets if the match is to succeed. This in turn forces the regex engine to backtrack when it tries to match that first \s*, so that the \s* matches any trailing space at the end of the first line of text, and one newline, and nothing more. This leaves the ^ and the rest of the second and third lines of the regex to match successfully.

    To solve your problem -- to make the last two lines of data optional -- you need to add newlines to the regex, like this:

    ^int\s+(\w+)\s*\n (?:^\s*Number\sof\sFlaps:\s(\d+)\s*\n)? (?:^\s*IP\sAddress:?\s(\S*)\s*\n)?

    In each case, the newline stops the preceding \s* from slurping up the indenting whitespace at the start of the next line. This makes it possible for the text inside the brackets to match without backtracking.

    Like most things related to regexes, this is hard to explain. The principle I want to get across is that every term followed by a star will match greedily (i.e. as many times as possible); and, if Perl can make the regex match without backtracking, it will. This is what was happening with your original version of the regex. The regex engine is prepared to backtrack in order to make the whole regex match, but not in order to match optional items. What the added newlines do in our example is to make sure that the regex engine has reached a point in the input text where the optional brackets can match without the need for backtracking.

    I don't feel I've explained this very well, so feel free to come back with questions about it -- but I'll be off the Net for a few days, so I can't reply promptly.

    Merry Christmas to all.

    Markus

      Thank you very much! It makes perfect sense. Actually to be honest I feel silly now that you've explained it to me as you have because it's the kind of thing I would have expected myself to catch.

      Have a great Christmas, and thanks again.
      - Scott

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://417348]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-25 16:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found