The TASK #2 of perl-weekly-challenge-061
was to split a given string into certain subparts.
There were two solutions that (ab)use Perl's regular expression engine to get all matches for the leading part of a regular expression.
Though being one of the authors, I'm not so sure about this approach.
How smart is the engine allowed to be? Is there a way to guarantee that it actually tries all possibilities?
The section Embedded Code Execution Frequency in perlre says:
How non-accepting pathways and match failures affect the number of times a pattern is executed is specifically unspecified and may vary depending on what optimizations can be applied to the pattern and is likely to change from version to version.
This is a rather clear statement, that the proposed solutions may fail in future versions of Perl.
But does this hold in any case?
See examples in this program:
#!/usr/bin/perl
use strict;
use warnings;
my $match = qr[([ab]+)([ab]+)];
my $str = 'aba';
$str =~ /^ $match $ (?{ print "1: $1-$2\n" }) [c] /x;
$str =~ /^ $match $ (?{ print "2: $1-$2\n" }) (?!) /x;
$str =~ /^ $match $ (??{ print "3: $1-$2\n"; qr[(?!)] }) /x;
__DATA__
2: ab-a
2: a-ba
3: ab-a
3: a-ba
Explanations to the numbered samples:
-
There is a non-matched character class [c] at the end of the pattern.
In my copy of the "Camel Book" (3rd Edition, 2000) it is stated that
the engine is smart enough to optimize away the match attempt if there is a single character,
but not if it is inside a character class.
The engine has become smarter since then: the (?{CODE}) block is not executed.
-
Currently, using a negative look-ahead assertion as a non-matcher outsmarts the
engine into trying to match the string.
I reckon that the matching attempt might be optimized away in future versions.
-
With a small change, the resulting pattern remains the same but isn't known to the regex engine
from the beginning, as the final part now is the returned value from a (??{CODE}) block.
To decide if there is a match, the CODE has to be executed and thus cannot be optimized
away.
Would sniffing at the CODE
and detecting that it always returns something non-matchable be "legal"?
I feel kind of safe with this but I may be wrong.
Would you agree with this statement, that seems to be in contrast to the quotation above?
A (??{CODE}) block is guaranteed to be executed, if the failing or success of a pathway
containing this block solely depends on the returned subexpression.
Could we even have a zero-width assertion like (?!?{CODE}) that always fails but must not be
optimized away in the sense of the previous proposition?
I'd be glad to see your opinions.
BTW: What matches and what is matched?
Is a regex matching a string or is a string matching a regex?
I don't know.
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|