comment on

The TASK #2 of perl-weekly-challenge-061 was to split a given string into certain subparts. There were two solutions that (ab)use Perl's regular expression engine to get all matches for the leading part of a regular expression. Though being one of the authors, I'm not so sure about this approach. How smart is the engine allowed to be? Is there a way to guarantee that it actually tries all possibilities?

The section Embedded Code Execution Frequency in perlre says:

How non-accepting pathways and match failures affect the number of times a pattern is executed is specifically unspecified and may vary depending on what optimizations can be applied to the pattern and is likely to change from version to version.

This is a rather clear statement, that the proposed solutions may fail in future versions of Perl. But does this hold in any case? See examples in this program:

#!/usr/bin/perl

use strict;
use warnings;

my $match = qr[([ab]+)([ab]+)];
my $str = 'aba';

$str =~ /^ $match $ (?{  print "1: $1-$2\n" }) [c] /x;
$str =~ /^ $match $ (?{  print "2: $1-$2\n" }) (?!) /x;
$str =~ /^ $match $ (??{ print "3: $1-$2\n"; qr[(?!)] }) /x;

__DATA__
2: ab-a
2: a-ba
3: ab-a
3: a-ba
[download]

Explanations to the numbered samples:

There is a non-matched character class [c] at the end of the pattern. In my copy of the "Camel Book" (3rd Edition, 2000) it is stated that the engine is smart enough to optimize away the match attempt if there is a single character, but not if it is inside a character class. The engine has become smarter since then: the (?{CODE}) block is not executed.
Currently, using a negative look-ahead assertion as a non-matcher outsmarts the engine into trying to match the string. I reckon that the matching attempt might be optimized away in future versions.
With a small change, the resulting pattern remains the same but isn't known to the regex engine from the beginning, as the final part now is the returned value from a (??{CODE}) block. To decide if there is a match, the CODE has to be executed and thus cannot be optimized away. Would sniffing at the CODE and detecting that it always returns something non-matchable be "legal"? I feel kind of safe with this but I may be wrong.

Would you agree with this statement, that seems to be in contrast to the quotation above?

A (??{CODE}) block is guaranteed to be executed, if the failing or success of a pathway containing this block solely depends on the returned subexpression.

Could we even have a zero-width assertion like (?!?{CODE}) that always fails but must not be optimized away in the sense of the previous proposition?

I'd be glad to see your opinions.

BTW: What matches and what is matched? Is a regex matching a string or is a string matching a regex? I don't know.

Greetings,
-jo

$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

In reply to (Ab)using the Regex Engine by jo37

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks