Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Trying to count the captures in a compiled regular expression

by BooK (Curate)
on May 03, 2004 at 01:01 UTC ( [id://349891]=perlquestion: print w/replies, xml ) Need Help??

BooK has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to find out how many captures a compiled regular expression will do. My first try was good enough for me, but hv pointed out several cases where it could break. While refining the code I came upon several other cases and wrote a big test suite.

The only case I can't handle yet is (?{ ... }) and it proves to be the most difficult. Because it can contain Perl code, there are many ways to defeat my little parsing skills. I'm looking for a way to safely ignore the content of (?{ ... }) and (??{ ... }) (consider how lucky I am that the regular expression engine is not reentrant...)

Here's the routine code:

sub captures { local $_ = shift; croak "$_ is not a compiled regexp" unless ref eq 'Regexp'; my @p; # remember the kind of opening parentheses we've seen my ($n, $x) = (0, 1); /\G\(\?[ism]*(x?)[^:]*:/gc # global /xism block && ( $x = $1 ? 1 : 0 ); while( /\G(?=.)/gcs ) { $x > 0 ? /\G[^[\\()#]+/gc # ignore standard stuff (/x) : /\G[^[\\()]+/gc; # ignore standard stuff /\G(?:\\.)+/gcs; # ignore backslashed stuff $x > 0 && /\G#.*/gc; # ignore comments under /x /\G\[\^?\]?[^]]*\]/gc; # character class /\G\(\?[ism]*(x?)[ism]*(?:-[ism]*(x?)[ism]*)?([:)])/gc && do { $x++, $3 eq ':' && push @p, '+x' if $1; # (?xism:...) and $x--, $3 eq ':' && push @p, '-x' if $2; # (?xism) blocks }; /\G\(\?\(\d+\)/gc && push @p, 'sp'; # conditional regexp /\G\(\?/gc && push (@p, 'sp') && next; # other special regexp /\G\(/gc && ($n++, push @p, 'cp'); # a capturing parenthese /\G\)/gc && do { # a closing parenthese $x-- if $p[-1] eq '+x'; # compute /x state $x++ if $p[-1] eq '-x'; pop @p; }; } $n; }

And here's the test suite (test 12 is a simple case where (?{ ... }) can break my code):

use Test::More; # test the captures() method my @regexps = ( [ qr/foo/, 0 ], [ qr/foo(.*)bar/, 1 ], [ qr/\(foo(bar(baz)*)/, 2 ], [ qr/((?=.)ldkj\(.*\)(?i:bar(.*))b)/, 2 ], [ qr/foo # (bar)/ims, 1 ], # 5 # thanks Hugo for suggesting these difficult cases [ qr{( x )}x, 1 ], [ qr{ (?x: # (comment) ) (?-x: # (capture) ) }, 1 ], [ qr{[()<>]}, 0 ], [ qr{([])<(>]+)}, 1 ], [ qr{[a # (comment) b]}, 0 ], # 10 # other difficult cases I've found on my own, # while browsing perlre and perlretut [ qr/(?x) # (comment) (?-x) # (capture) (?x) # (comment)/, 1 ], [ qr/foo(?{ print ( "foo" ) })bar/, 0 ], # the following are taken from perlre [ qr< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, backtracking-sa +fe. }) )* aaaa (?{ $res = $cnt }) # On success copy to non-localized # location. >x, 1 ], [ qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x, 0 ], [ qr{ \( ( [^()]+ # x+ | \( [^()]* \) )+ \) }x, 1 ], # 15 [ qr{ \( ( (?> [^()]+ ) # change x+ above to (?> x+ ) | \( [^()]* \) )+ \) }x, 1 ], [ qr{(?>#[ \t]*)}, 0 ], [ qr{#[ \t]*(?![ \t])}, 0 ], [ qr/ (?> \# [ \t]* ) ( .+ ) /x, 1 ], [ qr/ \# [ \t]* ( [^ \t] .* ) /x, 1 ], # 20 [ qr{ ( \( )? [^()]+ (?(1) \) ) }x, 1 ], # taken from perlretut [ qr/^(.+)(e|r)(.*)$/, 3 ], [ qr/^ [+-]? # first, match an optional sign ( # then match integers or f.p. mantissas: \d+\.\d+ # mantissa of the form a.b |\d+\. # mantissa of the form a. |\.\d+ # mantissa of the form .b |\d+ # integer of the form a ) ([eE][+-]?\d+)? # finally, optionally match an exponent $/x, 2 ], [ qr/^ [+-]?\ * # first, match an optional sign *and space* ( # then match integers or f.p. mantissas: \d+\.\d+ # mantissa of the form a.b |\d+\. # mantissa of the form a. |\.\d+ # mantissa of the form .b |\d+ # integer of the form a ) ([eE][+-]?\d+)? # finally, optionally match an exponent $/x, 2 ], [ qr/^ [+-]?\ * # first, match an optional sign ( # then match integers or f.p. mantissas: \d+ # start out with a ... ( \.\d* # mantissa of the form a.b or a. )? # ? takes care of integers of the form a |\.\d+ # mantissa of the form .b ) ([eE][+-]?\d+)? # finally, optionally match an exponent $/x, 3 ], # 25 [ qr/^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/, 3 ], [ qr/(?# Match an integer:)[+-]?\d+/, 0 ], [ qr/(?# Match an integer:)[+-]?\d+/, 0 ], [ qr/(?x)( # freeform version of an integer regexp [+-]? # match an optional sign \d+ # match a sequence of digits ) /x, 1 ], [ qr/([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/, 4 ], # 30 [ qr/([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/, 1 ], [ qr/([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/, 2 ], [ qr/\( ( [^()]+ | \([^()]*\) )+ \)/x, 1 ], [ qr/\( ( (?>[^()]+) | \([^()]*\) )+ \)/x, 1 ], [ qr/^(\w+)(\w+)?(?(2)\2\1|\1)$/, 2 ], # 35 [ qr/[ATGC]+(?(?<=AA)G|C)$/, 0 ], [ qr/(?{local $c = 0;}) # initialize count ( a # match 'a' (?{local $c = $c + 1;}) # increment count )* # do this any number of times, aa # but match 'aa' at the end (?{$count = $c;}) # copy local $c var into $count /x, 1 ], [ qr/(?(?{ $lang eq 'EN'; # is the language English? }) the | # if so, then match 'the' (die|das|der) # else, match 'die|das|der' ) /xi, 1 ], [ qr/^1 # match an initial '1' ( (??{'0' x $s0}) # match $s0 of '0' 1 # and then a '1' (?{ $largest = $s0; # largest seq so far $s2 = $s1 + $s0; # compute next term $s0 = $s1; # in Fibonacci sequence $s1 = $s2; }) )+ # repeat as needed $ # that is all there is /x, 1 ], ); plan tests => scalar @regexps; for ( @regexps ) { is( captures( $_->[0] ), $_->[1], "$_->[1] captures" ); }

Any help appreciated...

Replies are listed 'Best First'.
Re: Trying to count the captures in a compiled regular expression
by hv (Prior) on May 03, 2004 at 02:11 UTC

    With respect to (?xism-xism:...), beware of the docs for this (my emphasis):

    One or more embedded pattern-match modifiers, to be turned on (or turned off, if preceded by "-") for the remainder of the pattern or the remainder of the enclosing pattern group (if any).
    If I understand that correctly, the "only look at openings" approach will fail on something like:
    /(?x:((?-x:)) # (comment) )/
    or
    /((?-x:)) # (comment) /x

    The simplistic (?{...}) parsing I referred to is in toke.c:scan_const(); look for the test

    else if (s[2] == '{' /* This should match regcomp.c */ || ((s[2] == 'p' || s[2] == '?') && s[3] == '{'))
    which simply counts unescaped braces until the opening one is closed - something like:
    our $re_true = qr{(?=)}x; our $re_false = qr{(?!)}x; our $count; / # (?{ ... }) or (??{ ... }) or (legacy) (?p{ ... }) \G \( \? (?: \? \?? | p ) (?= \{ ) (?{ local $count = 0; }) (?: \{ (?{ local $count = $count + 1 }) | \} (?{ local $count = $count - 1 }) | \\ . | . )+? (??{ $count == 0 ? $re_true : $re_false }) /xgc;
    would be fitting, though I suspect there must be a simpler way.

    (consider how lucky I am that the regular expression engine is not reentrant...)

    Now now, no need for that sort of language.

    Hugo

      My opinion on (?{ ... }) was that I needed to know how Perl did it to make sure I'd do it right. I'm glad you answered: it proves I was not the only one wanting it to work the hard way. ;-)

      Thanks to Roy Johnson below, I guess I'll blend in a does of simplicity in my code. There's still a bug somewhere, though.

      Well, at least I can keep the test suite. *sigh*

Re: Trying to count the captures in a compiled regular expression
by Roy Johnson (Monsignor) on May 03, 2004 at 02:46 UTC
    How about just using it, marked optional, and seeing how many captures you get?
    $_ = 'anything'; my @capture_count; my $regex = qr/one(.*?)((two)(four))/; @capture_count = /(?:$regex)?/; print @capture_count." captures\n";

    The PerlMonk tr/// Advocate

      Ack. I guess spending two days trying to grok yacc and Parse::Yapp led me to choose to do it the hard way. Thanks for a healthy does of simplicity.

      Unfortunately, the following regexp break your very good idea:

      $regex = qr/foo/; # should return 0

      That only means your code needs at least one capture:

      sub captures { my $re = shift; return scalar ( @_ = '' =~ /(?:($re))?/ ) - 1; }

      This passes my whole test suite, expect for this regexp:

      $regex = qr/(?x) # (comment) (?-x) # (capture) (?x) # (comment)/;

      Which dies horribly with the message:

      Unmatched ( in regex; marked by <-- HERE in m/(?:(( <-- HERE ?-xism:(?x)  # (comment)
                  (?-x) # (capture) (?x) # (comment))))?/

      So, $regex compiles, but qr/(?:($regex))?/ doesn't? I'm lost.

        expect for this regexp...

        I guess the matching close paren gets commented out in the combined string, though if so I think that's probably a bug: I think we tried it fix it a while back so that a comment in an embedded qr// would not leak out to the enclosing pattern.

        Hugo

      Very nice. :)

      You can go a step better by making the outer match minimal, which means it will immediately match zero times and thus avoid the time and danger of trying to match the interior at all:

      @capture_count = /($regex)??/; print @capture_count - 1, " captures\n";

      Hugo

      Wow, that works! I knew there had to be a way to evaluate the regexp to get an answer. Good stuff! :)
        Security alert: that will run code in (??{ }) / (?p{ }). Maybe something like this instead?
        $regex = qr:(??{print "look ma, no rm -rf /\n"}):; $captures = (() = ""=~/(|$regex)/) - 1;
        Also, note that you have to add a () set and then subtract it from the count to be able to distinguish between 0 captures and 1 capture.

      Could you explain why/how that works?


      ___________
      Eric Hodges
        Upon a successful match, the match operator returns a list of the captures (or, in scalar context, the number of captures), one element for each set of capturing parentheses, even if the captured value is empty. By making the entire pattern optional, we ensure a successful match, and thus Perl will tell us how many groupings were.

        As was pointed out by ysth and hugo, the case of no groupings will not return zero (because we need a true value to indicate a successful match), so we ought to put capturing parentheses around the expression, and subtract one from the result. And the possibility of embedded code, which we wouldn't want to run, is another caveat, so we want to use minimal matching. Hence:

        my $regex = qr/foo/; $_ = 'anything'; my $matches = (() = /($regex)??/) - 1; # oops! fixed print "There were $matches groupings\n";

        The PerlMonk tr/// Advocate
Re: Trying to count the captures in a compiled regular expression
by exussum0 (Vicar) on May 03, 2004 at 02:05 UTC
Re: Trying to count the captures in a compiled regular expression
by japhy (Canon) on May 03, 2004 at 16:07 UTC
    In lieu of my unfinished Regexp::Parser, you could use Graham Barr's Regexp module, which allows you to say:
    use Regexp; my $rx = Regexp->new('pattern'); my $nparens = $rx->nparens;
    His module is not a parser like mine is, but rather, an XS interface to the C code behind the regex engine.
    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://349891]
Approved by Zaxo
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-04-18 22:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found