Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

(Ab)using the Regex Engine

by jo37 (Friar)
on May 25, 2020 at 09:19 UTC ( #11117224=perlmeditation: print w/replies, xml ) Need Help??

The TASK #2 of perl-weekly-challenge-061 was to split a given string into certain subparts. There were two solutions that (ab)use Perl's regular expression engine to get all matches for the leading part of a regular expression. Though being one of the authors, I'm not so sure about this approach. How smart is the engine allowed to be? Is there a way to guarantee that it actually tries all possibilities?

The section Embedded Code Execution Frequency in perlre says:

How non-accepting pathways and match failures affect the number of times a pattern is executed is specifically unspecified and may vary depending on what optimizations can be applied to the pattern and is likely to change from version to version.
This is a rather clear statement, that the proposed solutions may fail in future versions of Perl. But does this hold in any case? See examples in this program:

#!/usr/bin/perl use strict; use warnings; my $match = qr[([ab]+)([ab]+)]; my $str = 'aba'; $str =~ /^ $match $ (?{ print "1: $1-$2\n" }) [c] /x; $str =~ /^ $match $ (?{ print "2: $1-$2\n" }) (?!) /x; $str =~ /^ $match $ (??{ print "3: $1-$2\n"; qr[(?!)] }) /x; __DATA__ 2: ab-a 2: a-ba 3: ab-a 3: a-ba

Explanations to the numbered samples:

  1. There is a non-matched character class [c] at the end of the pattern. In my copy of the "Camel Book" (3rd Edition, 2000) it is stated that the engine is smart enough to optimize away the match attempt if there is a single character, but not if it is inside a character class. The engine has become smarter since then: the (?{CODE}) block is not executed.
  2. Currently, using a negative look-ahead assertion as a non-matcher outsmarts the engine into trying to match the string. I reckon that the matching attempt might be optimized away in future versions.
  3. With a small change, the resulting pattern remains the same but isn't known to the regex engine from the beginning, as the final part now is the returned value from a (??{CODE}) block. To decide if there is a match, the CODE has to be executed and thus cannot be optimized away. Would sniffing at the CODE and detecting that it always returns something non-matchable be "legal"? I feel kind of safe with this but I may be wrong.
Would you agree with this statement, that seems to be in contrast to the quotation above?
A (??{CODE}) block is guaranteed to be executed, if the failing or success of a pathway containing this block solely depends on the returned subexpression.
Could we even have a zero-width assertion like (?!?{CODE}) that always fails but must not be optimized away in the sense of the previous proposition?

I'd be glad to see your opinions.

BTW: What matches and what is matched? Is a regex matching a string or is a string matching a regex? I don't know.



Replies are listed 'Best First'.
Re: (Ab)using the Regex Engine
by tybalt89 (Prior) on May 25, 2020 at 22:00 UTC
    #!/usr/bin/perl use strict; # use warnings; my $match = qr[([ab]+)([ab]+)]; my $str = 'aba'; $str =~ /^ $match $ (?{ print "1: $1-$2\n" }) [c] /x; $str =~ /^ $match $ (?{ print "2: $1-$2\n" }) (?!) /x; $str =~ /^ $match $ (??{ print "3: $1-$2\n"; qr[(?!)] }) /x; # The fact that this is a defined pattern probably means # there is a lesser chance of it being optimized away. $str =~ /^ $match $ (?{ print "4: $1-$2\n" }) (*FAIL) /x;

      A very good point that is ideed the ultimate answer to my question by guiding to Special Backtracking Control Verbs.

      It states that (*FAIL) can be used to force the engine into backtracking and that this is equivalent to (?!). So version 2 and yours are basically the same and both are guaranteed to work. The trickery from version 3 is not needed.

      So in the end it is "use" and not "abuse".



        I'd call it "abuse". My bet is this pattern of application is well-known and tolerated for the sake of critical mass of existing "cool examples of (ab)using re-engine", and therefore safe to use in the future :). Stand-alone (*F) is guaranteed to fail, there's no need to "force to backtrack" while staying in the same branch; and as there are no other branches in your example, the whole matching must have been optimized away. On the other hand, something like (?(?{CODE})(*F)), with CODE result depending on sub-matches so far, is legitimate use and another matter entirely, but not the case here.

        The impression is, aforementioned tolerance goes as far as injection of (*F) makes (but not always) engine fail to fail early, which is funny.

        my $match = qr[([ab]+)([ab]+)]; my $str = 'aba'; $str =~ /^ $match $ (?{ print "1: $1-$2\n" }) a /x; $str =~ /^ $match $ (?{ print "2: $1-$2\n" }) b /x; $str =~ /^ $match $ (?{ print "3: $1-$2\n" }) (*F) b /x; $str =~ /^ $match $ (?{ print "4: $1-$2\n" }) (*F) .. /x; __END__ 1: ab-a 1: a-ba 3: ab-a 3: a-ba
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://11117224]
Approved by hippo
Front-paged by haukex
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2020-11-25 03:00 GMT
Find Nodes?
    Voting Booth?

    No recent polls found