http://qs321.pair.com?node_id=701618

rovf has asked for the wisdom of the Perl Monks concerning the following question:

I wanted to substitute the extension of a filename by 'txt', that is to transform abc/xyz.something into abc/xyz.txt. This is easy, but maybe driven by the sudden thought that I'm becoming an old man as the years passed by without ever having tried a positive look behind regexp, I came up with the following silly solution:

use strict; use warnings; my $fn="abc/def.xyz"; $fn =~ s/($<=[.])[^.]*$/txt/; print "$fn\n";
That is, substitute the longest string at the end of the filename which does not contain a period, but is preceded with one. Interestingly, this did not work - no substitution was taking place.

In my case this is overkill, because I happen to know in my filename that there *is* a period, so I could have much easier written

$fn =~ s/[^.]*$/txt/;
(this would however replace the complete filename with txt if it doesn't contain a period).

Nevertheless I would like to know *why* my original solution has failed. Any suggestions?

-- 
Ronald Fischer <ynnor@mm.st>

Replies are listed 'Best First'.
Re: positive look behind regexp mystery
by moritz (Cardinal) on Aug 01, 2008 at 08:41 UTC
    The syntax for look-behinds is (?<=...), not ($<=...).
      The syntax for look-behinds is (?<=...), not ($<=...)

      :-O
      I definitely *wanted* to type a '?', and when I was *looking* at the code, my brain was always telling me that there is a question mark! I simply did not see the typo!!!! Thanks a lot!

      -- 
      Ronald Fischer <ynnor@mm.st>
Re: positive look behind regexp mystery (\K assertion)
by lodin (Hermit) on Aug 14, 2008 at 12:54 UTC

    This is a great case for the \K assertion (update: forgot to mention that \K is new for 5.10 but available to "everyone" via Regexp::Keep by Jeff Pinyan who come up with the idea (I don't know if that will provide you the same efficiency though)). Not only is it easier, but it's also more efficient due to the optimizations of the regexp engine. The pattern would look like this:

    s/\.\K[^.]*$/txt/;
    The great part with this is that the engine can start looking for a literal (the dot) and avoid a lot of backtracking. The output of use re 'debug'; will visualize this.

    With the look-behind pattern, you see there's a lot of backtracking going on, and the engine guesses a match at the beginning of the string (the string is "xyz.foo" in the examples below).

    Compiling REx "(?<=[.])[^.]*$" Final program: 1: IFMATCH[-1] (7) 3: EXACT <.> (5) 5: SUCCEED (0) 6: TAIL (7) 7: STAR (19) 8: ANYOF[\0-\-/-\377{unicode_all}] (0) 19: EOL (20) 20: END (0) floating ""$ at 0..2147483647 (checking floating) minlen 0 Guessing start of match in sv for REx "(?<=[.])[^.]*$" against "xyz.fo +o" Found floating substr ""$ at offset 7... Guessed: match at offset 0 Matching REx "(?<=[.])[^.]*$" against "xyz.foo" 0 <> <xyz.foo> | 1:IFMATCH[-1](7) failed... 1 <x> <yz.foo> | 1:IFMATCH[-1](7) 0 <> <xyz.foo> | 3: EXACT <.>(5) failed... failed... 2 <xy> <z.foo> | 1:IFMATCH[-1](7) 1 <x> <yz.foo> | 3: EXACT <.>(5) failed... failed... 3 <xyz> <.foo> | 1:IFMATCH[-1](7) 2 <xy> <z.foo> | 3: EXACT <.>(5) failed... failed... 4 <xyz.> <foo> | 1:IFMATCH[-1](7) 3 <xyz> <.foo> | 3: EXACT <.>(5) 4 <xyz.> <foo> | 5: SUCCEED(0) subpattern success... 4 <xyz.> <foo> | 7:STAR(19) ANYOF[\0-\-/-\377{unicode_all}] can +match 3 times out of 2147483647... 7 <xyz.foo> <> | 19: EOL(20) 7 <xyz.foo> <> | 20: END(0) Match successful!
    However, if we look at the \K pattern, get get this:
    Compiling REx "\.\K[^.]*$" Final program: 1: EXACT <.> (3) 3: KEEPS (4) 4: STAR (16) 5: ANYOF[\0-\-/-\377{unicode_all}] (0) 16: EOL (17) 17: END (0) anchored "." at 0 floating ""$ at 1..2147483647 (checking anchored) mi +nlen 1 Guessing start of match in sv for REx "\.\K[^.]*$" against "xyz.foo" Found anchored substr "." at offset 3... Found floating substr ""$ at offset 7... Starting position does not contradict /^/m... Guessed: match at offset 3 Matching REx "\.\K[^.]*$" against ".foo" 3 <xyz> <.foo> | 1:EXACT <.>(3) 4 <xyz.> <foo> | 3:KEEPS(4) 4 <xyz.> <foo> | 4: STAR(16) ANYOF[\0-\-/-\377{unicode_all}] ca +n match 3 times out of 2147483647... 7 <xyz.foo> <> | 16: EOL(17) 7 <xyz.foo> <> | 17: END(0) Match successful!
    That's nice. No backtracking.

    lodin

      This is a great case for the \K assertion.

      I have never heard of \K and can't find it in perlre. Is this a very new feature?

      -- 
      Ronald Fischer <ynnor@mm.st>
        It was added in 5.10.0.
        The doc you linked to (perlre) has tons of references to \K. :-) You, like me, must be still at perl5.8, and we don't have it in our docs. But if you are in 5.10, then read again :-)
        []s, HTH, Massa (κς,πμ,πλ)
Re: positive look behind regexp mystery
by BrowserUk (Patriarch) on Aug 01, 2008 at 08:48 UTC
Re: positive look behind regexp mystery
by Anonymous Monk on Aug 01, 2008 at 08:32 UTC
    see use re 'debug';