Re: Why is variable length lookahead implemented while lookbehind is not?
by davido (Cardinal) on Sep 07, 2011 at 19:36 UTC
|
The regex engine won't backtrack to validate a variable width assertion, and that's what would be necessary to make variable width work, from what I understand.
perlre suggests a workaround with \K, and I've seen people suggest that one of the most effective variable width lookbehinds is reverse (process the reversed string instead).
| [reply] |
|
| [reply] |
Re: Why is variable length lookahead implemented while lookbehind is not?
by Jim (Curate) on Sep 07, 2011 at 19:42 UTC
|
In Important Notes About Lookbehind, Jan Goyvaerts explains it this way:
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.
In modern versions of Perl, variable-length look-behind is effectively supported using the \K pattern ("K" for "keep"). See perlre.
#!perl
use strict;
use warnings;
use feature qw( say );
my $string = 'aaaabcccc';
$string =~ s/a+\Kb(?=c+)/B/;
say $string;
__END__
aaaaBcccc
| [reply] [d/l] [select] |
|
Correct me if I'm wrong, but doesn't the \K special escape emulate only the positive look-behind assertion? This is certainly suggested by perlre, in which the use of \K for variable-length look-behind is discussed only in the context of the (?<=pattern) assertion.
Update: However, the (*SKIP)(*FAIL) pair of Special Backtracking Control Verbs (5.10+) can be pressed into service for negative look-behind:
>perl -wMstrict -le
"my $s = 'aaxbbbxccccxddxex';
;;
printf qq{'$_' } for $s =~ m{ [a-e]+ x }xmsg; print '';
;;
printf qq{'$_' }
for $s =~ m{ [bd]+ (*SKIP)(*FAIL) | [a-e]+ x }xmsg; print '';
;;
printf qq{'$_' }
for $s =~ m{ (?: [bd]+ (*SKIP)(*FAIL))? [a-e]+ x }xmsg; print '';
"
'aax' 'bbbx' 'ccccx' 'ddx' 'ex'
'aax' 'ccccx' 'ex'
'aax' 'ccccx' 'ex'
| [reply] [d/l] [select] |
Re: Why is variable length lookahead implemented while lookbehind is not?
by moritz (Cardinal) on Sep 07, 2011 at 19:50 UTC
|
When you write a backtracking regex engine like the one in Perl 5, a full (variable length) look-ahead is pretty easy to implement. You just have to a normal match, and if the match succeeds, you reset the match position to where the look-ahead started off. And one can mark the look-ahead as not needing to backtrack.
On the other hand a look-behind requires either a backwards search, or some sufficiently advanced magic that resets the match position to the start of the string, and then matches something like (?s:.*)$look_behind\K, and anchors the end of that match to the position where the look-behind started, and then resets all the match variables as if that match has never happened, execept the captures from within $look_behind.
And even when you do that, the performance is dependent on the length of the string up to the current position, instead of just depending on the string in the vicinity of the current match position; to do it properly, you'd really need to search backwards.
| [reply] [d/l] [select] |
|
'bbbbbb' =~ /(?<=[^z]*a)b/
requires checking for "z" and "a" a total of 72 times each before failing even though the string is only 6 chars long!
Oops, I guess you covered this at the end, but it seems to me that it's a foremost concern.
It also forces
'xiyjykz' =~ /(?<=x(.*)y(.*))z/s
to produce
$1 = 'iyj'; $2 = 'k'
but one might expect
$1 = 'i'; $2 = 'jyk'
| [reply] [d/l] [select] |
|
When you write a backtracking regex engine like the one in Perl 5, a full (variable length) look-ahead is pretty easy to implement.
It seems to me that the easy solution would produce bad performance. The easy solution requires matching each lookbehind subpattern at pos $current_pos - $max_match_length
(emphasis mine)
It seems we're talking about two different things here. The simple approach for look-ahead is not slow (at least if you take care not to backtrack the look-ahead). And that's probably the reason it's implemented in Perl, and most advanced regex engines.
In the case of look-behind, we're in violent agreement :-)
| [reply] |
|
| [reply] |
|
|
Re: Why is variable length lookahead implemented while lookbehind is not?
by ikegami (Patriarch) on Sep 07, 2011 at 19:46 UTC
|
To save from writing a new regex engine to matches from right to left to use for lookbehinds, the existing engine simply starts matching the subpattern $match_length positions before the current one. | [reply] |
Re: Why is variable length lookahead implemented while lookbehind is not?
by halley (Prior) on Sep 07, 2011 at 19:40 UTC
|
Having never looked at Perl's C implementation of this, I think it's just a matter of not writing support for it, because it is "touchy" code after so many minor tweaks, and supporting both directions may need yet more maintenance.
Theoretically, any regular expression can be processed in reverse; that is, when searching for /def$/, you can look backwards through the string for the end of a line, then for an 'f' before it, then for an 'e' before that, and so on. This holds true no matter what the complexity of the pattern may be.
In practice, there's a lot about a regex matcher engine that must be written with an expectation of the direction you're walking through the pattern and the source string. Parameterizing the direction or stride as +1/-1 is perhaps not trivial. So they punted: simple strings with an obvious anchor (current spot, end of line, etc.) can appear to be scanned backwards by calculating where they would start and scanning forwards.
-- [ e d @ h a l l e y . c c ]
| [reply] [d/l] [select] |