http://qs321.pair.com?node_id=1177010

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I played with regular expression and perl does something different as I would expect. Only the "+" operator worked as expected. When the "*" or "?" is at the end of a expression, is seems it is sometimes not greedy. Please see my example:

use strict; use warnings; $_="kkkaaabc"; print "String = $_\n"; print "1 $& Expected: ka ?? /k?a?/\n" if /k?a?/; print "2 $& Expected: kkka OK\n" if /k*a?/; print "3 $& Expected: kaaa OK\n" if /k?a+/; print "4 $& Expected: kkkaaa OK\n" if /k+a+/; print "5 $& Expected: kaaa ?? /k?a*/\n" if /k?a*/; print "6 $& Expected: kkkaaa OK\n" if /k*a*/;

The result is:

String = kkkaaabc 1 k Expected: ka ?? /k?a?/ 2 kkka Expected: kkka OK 3 kaaa Expected: kaaa OK 4 kkkaaa Expected: kkkaaa OK 5 k Expected: kaaa ?? /k?a*/ 6 kkkaaa Expected: kkkaaa OK

Why do I get in case 1 and 5 no "a" at the end of the matching circuit ??. Many Thanks for any help !

Replies are listed 'Best First'.
Re: basic question: regular expression
by Ratazong (Monsignor) on Dec 01, 2016 at 07:34 UTC

    Hi!

    The reason is that * and ? also match zero times. So in your example 1, the regEx already matches the first k, which is followed by zero as. Even if a longer sequence comes later on in the string.

    If there is a match, the regEx tries to be as greedy as possible, that's why you get kkka in example 2.

    HTH, Rata

Re: basic question: regular expression
by BrowserUk (Patriarch) on Dec 01, 2016 at 08:28 UTC
    Why do I get in case 1 and 5 no "a" at the end of the matching circuit ??.
    1. You asked for if /k?a?/: optionally 'k', optionally followed by 'a'.

      The optional 'k' was satisfied; everything after that was optional, so STOP here.

    2. You asked for if /k?a*/: Optionally 'k';, followed (or not) by zero or more 'a's.

      Found a 'k'; everything after is optional, so STOP here.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: basic question: regular expression
by Laurent_R (Canon) on Dec 01, 2016 at 08:51 UTC
    The regex engine will not backtrack to find a "better" (i.e. longer) match; it will backtrack only as long as there no successful match. Greediness will apply only within the context of a given match.

    In your two examples, the regex engine could find a successful match using the first "k" of your string. In such a case, it will simply report success and will not try anything to get a longer match by backtracking to the second "k".

Re: basic question: regular expression
by AnomalousMonk (Archbishop) on Dec 01, 2016 at 13:50 UTC

    It's sometimes useful to know where in a string a match occurs.

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'kkkaaabc'; for my $rx ( 'k?a?', 'k*a?', 'k?a+', 'k+a+', 'k?a*', 'k*a*', 'a*', ) { print qq{'$s'}; ;; $s =~ m{ ($rx) }xms; ;; if (not defined $1) { print 'no match'; next; } ;; print ' ', ' ' x $-[1], '^' x ($+[1] - $-[1]), qq{ /$rx/ matched '$1' at offset $-[1]}; } " 'kkkaaabc' ^ /k?a?/ matched 'k' at offset 0 'kkkaaabc' ^^^^ /k*a?/ matched 'kkka' at offset 0 'kkkaaabc' ^^^^ /k?a+/ matched 'kaaa' at offset 2 'kkkaaabc' ^^^^^^ /k+a+/ matched 'kkkaaa' at offset 0 'kkkaaabc' ^ /k?a*/ matched 'k' at offset 0 'kkkaaabc' ^^^^^^ /k*a*/ matched 'kkkaaa' at offset 0 'kkkaaabc' /a*/ matched '' at offset 0
    Note in particular the  /a*/ regex which I added at the end. This matches the empty string at offset 0 even though a perfectly good  'aaa' sequence is available further on. Engrave "Leftmost Longest" on a prayer wheel and keep it ever spinning in your mind.

    Regexes are the most counterintuitive thing I've encountered in the realm of programming.

    Updates:

    1. The initialization of capture variables (e.g., $1) to undef on regex recompilation is apparently only available from Perl version 5.10 onward. The code
      my $match = $s =~ m{ ($rx) }xms; if (not $match) { print 'no match'; next; }
      is more portable among different Perl versions.
    2. Also check out davido's Perl Regular Expression Tester


    Give a man a fish:  <%-{-{-{-<