http://qs321.pair.com?node_id=372786

stijnvs has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

The perlre man page states:

By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match.

Hence, when trying the following simple program:

"abc" =~ /(a|ab)*(bc|c)/; print "$1 $2\n";

One would expect the output to be ab c. Executing the above program with Perl version 5.8.3 returns a bc however.

My question now is: which one is wrong, me, the man page or the implementation?

Replies are listed 'Best First'.
Re: Greedy regexp matching: differences between man page and implementation?
by Corion (Patriarch) on Jul 08, 2004 at 13:27 UTC

    Both, the documentation and the implementation are correct, because any regex match favours the leftmost interpretation. If you reorder your alternatives in your capturing first parentheses, I expect you will see the other match ab c, but as the single-letter a is already enough to make the first parentheses match once, the alternation won't be tried at all, as the overall match can succeed that way.

    With Perls regular expression engine, you will always get the longest leftmost match, with "leftmost" taking precedence over "longest", which is what is biting you in the first capture.

      This is an even more obvious:
      $_='abcdefgh'; /^(a|abcdefg)(h|bc)/; print "[$1] [$2]\n";
      It prints [a] [bc]. The regexp engine clearly doesn't run through all possible combinations within (..|..|..) patterns.
Re: Greedy regexp matching: differences between man page and implementation?
by Roy Johnson (Monsignor) on Jul 08, 2004 at 13:45 UTC
    Be sure to see this meditation for better understanding of how it's supposed to (and does) work.

    In your example, it did match as many times as possible: once. Matching ab instead of a would not be matching more times, just more characters.


    We're not really tightening our belts, it just feels that way because we're getting fatter.
Re: Greedy regexp matching: differences between man page and implementation?
by japhy (Canon) on Jul 08, 2004 at 13:48 UTC
    As Hugo van der Sande wrote less than a week ago, regexes execute their alternations in the order you provide them. Thus, the "a" alternation is attempted before the "ab" one.
    _____________________________________________________
    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Greedy regexp matching: differences between man page and implementation?
by davido (Cardinal) on Jul 08, 2004 at 16:02 UTC
    You might try it like this:
    /(ab?)*(b?c)/
    The output becomes: ab c.

    Dave