Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Why is variable length lookahead implemented while lookbehind is not?

by Kc12349 (Monk)
on Sep 07, 2011 at 19:02 UTC ( [id://924643]=perlquestion: print w/replies, xml ) Need Help??

Kc12349 has asked for the wisdom of the Perl Monks concerning the following question:

This is just a curiosity, but why is variable length lookbehind not implemented while variable length lookahead is? Is it because there is no logical need for it, i.e. there is a better way to write a regex that uses it? Or is due to some limitation of how regex can be implemented in perl?

Admittedly, each time I have run into the error writing a quick regex, there has indeed been a better way to write the pattern. To that end, a second question would be has anyone run into a use case where variable length lookbehind would be clearly useful over alternative patterns?

The below is a useless piece of code save only to generate the error.

my $string = 'aaaabcccc'; $string =~ s/(?<=a+)b(?=c+)/B/; say $string; # Variable length lookbehind not implemented in regex m/(?<=a+)b(?=c+) +/ at C:\Projects\test.pl line 15.
  • Comment on Why is variable length lookahead implemented while lookbehind is not?
  • Download Code

Replies are listed 'Best First'.
Re: Why is variable length lookahead implemented while lookbehind is not?
by davido (Cardinal) on Sep 07, 2011 at 19:36 UTC

    The regex engine won't backtrack to validate a variable width assertion, and that's what would be necessary to make variable width work, from what I understand.

    perlre suggests a workaround with \K, and I've seen people suggest that one of the most effective variable width lookbehinds is reverse (process the reversed string instead).


    Dave

      Thanks. That's very helpful. \K fills any need I occasionally come upon for variable lookbehind.

Re: Why is variable length lookahead implemented while lookbehind is not?
by Jim (Curate) on Sep 07, 2011 at 19:42 UTC

    In Important Notes About Lookbehind, Jan Goyvaerts explains it this way:

    The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

    In modern versions of Perl, variable-length look-behind is effectively supported using the \K pattern ("K" for "keep"). See perlre.

    #!perl use strict; use warnings; use feature qw( say ); my $string = 'aaaabcccc'; $string =~ s/a+\Kb(?=c+)/B/; say $string; __END__ aaaaBcccc

      Correct me if I'm wrong, but doesn't the  \K special escape emulate only the positive look-behind assertion? This is certainly suggested by perlre, in which the use of  \K for variable-length look-behind is discussed only in the context of the  (?<=pattern) assertion.

      Update: However, the  (*SKIP)(*FAIL) pair of Special Backtracking Control Verbs (5.10+) can be pressed into service for negative look-behind:

      >perl -wMstrict -le "my $s = 'aaxbbbxccccxddxex'; ;; printf qq{'$_' } for $s =~ m{ [a-e]+ x }xmsg; print ''; ;; printf qq{'$_' } for $s =~ m{ [bd]+ (*SKIP)(*FAIL) | [a-e]+ x }xmsg; print ''; ;; printf qq{'$_' } for $s =~ m{ (?: [bd]+ (*SKIP)(*FAIL))? [a-e]+ x }xmsg; print ''; " 'aax' 'bbbx' 'ccccx' 'ddx' 'ex' 'aax' 'ccccx' 'ex' 'aax' 'ccccx' 'ex'
Re: Why is variable length lookahead implemented while lookbehind is not?
by moritz (Cardinal) on Sep 07, 2011 at 19:50 UTC

    When you write a backtracking regex engine like the one in Perl 5, a full (variable length) look-ahead is pretty easy to implement. You just have to a normal match, and if the match succeeds, you reset the match position to where the look-ahead started off. And one can mark the look-ahead as not needing to backtrack.

    On the other hand a look-behind requires either a backwards search, or some sufficiently advanced magic that resets the match position to the start of the string, and then matches something like (?s:.*)$look_behind\K, and anchors the end of that match to the position where the look-behind started, and then resets all the match variables as if that match has never happened, execept the captures from within $look_behind.

    And even when you do that, the performance is dependent on the length of the string up to the current position, instead of just depending on the string in the vicinity of the current match position; to do it properly, you'd really need to search backwards.

      When you write a backtracking regex engine like the one in Perl 5, a full (variable length) look-ahead is pretty easy to implement.

      It seems to me that the easy solution would produce bad performance. The easy solution requires matching each lookbehind subpattern at pos $current_pos - $max_match_length, which is 0 if /.*/ is used!

      'bbbbbb' =~ /(?<=[^z]*a)b/

      requires checking for "z" and "a" a total of 72 times each before failing even though the string is only 6 chars long!

      Oops, I guess you covered this at the end, but it seems to me that it's a foremost concern.

      It also forces

      'xiyjykz' =~ /(?<=x(.*)y(.*))z/s
      to produce
      $1 = 'iyj'; $2 = 'k'
      but one might expect
      $1 = 'i'; $2 = 'jyk'
        When you write a backtracking regex engine like the one in Perl 5, a full (variable length) look-ahead is pretty easy to implement.

        It seems to me that the easy solution would produce bad performance. The easy solution requires matching each lookbehind subpattern at pos $current_pos - $max_match_length

        (emphasis mine)

        It seems we're talking about two different things here. The simple approach for look-ahead is not slow (at least if you take care not to backtrack the look-ahead). And that's probably the reason it's implemented in Perl, and most advanced regex engines.

        In the case of look-behind, we're in violent agreement :-)

        My first thought for an "easy solution" is to generate a reversed version of the target string if there is a variable length look behind, then use the usual look ahead match processing on the reversed string starting at the "reversed" anchor point.

        True laziness is hard work
Re: Why is variable length lookahead implemented while lookbehind is not?
by ikegami (Patriarch) on Sep 07, 2011 at 19:46 UTC
    To save from writing a new regex engine to matches from right to left to use for lookbehinds, the existing engine simply starts matching the subpattern $match_length positions before the current one.
Re: Why is variable length lookahead implemented while lookbehind is not?
by halley (Prior) on Sep 07, 2011 at 19:40 UTC

    Having never looked at Perl's C implementation of this, I think it's just a matter of not writing support for it, because it is "touchy" code after so many minor tweaks, and supporting both directions may need yet more maintenance.

    Theoretically, any regular expression can be processed in reverse; that is, when searching for /def$/, you can look backwards through the string for the end of a line, then for an 'f' before it, then for an 'e' before that, and so on. This holds true no matter what the complexity of the pattern may be.

    In practice, there's a lot about a regex matcher engine that must be written with an expectation of the direction you're walking through the pattern and the source string. Parameterizing the direction or stride as +1/-1 is perhaps not trivial. So they punted: simple strings with an obvious anchor (current spot, end of line, etc.) can appear to be scanned backwards by calculating where they would start and scanning forwards.

    --
    [ e d @ h a l l e y . c c ]

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://924643]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2024-04-24 09:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found