Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

What regex can match arbitrary-length quoted string?

by jimav (Sexton)
on Sep 28, 2017 at 20:47 UTC ( [id://1200316]=perlquestion: print w/replies, xml ) Need Help??

jimav has asked for the wisdom of the Perl Monks concerning the following question:

I want to use a regex to match double-quoted strings of arbitrary length. But the regex recommended in perlre(1) seems to use recursion even though greedy *+ is used, which that man page seems to say should prevent backtracking.

Any expression of the form  ( A | B )*+ seems to use recursion.
Can someone explain why recursion, rather than iteration, is used by the regex engine for this?

Here is a demo script showing the problem:

#!/usr/bin/perl use strict; use warnings; $_ = '"' . ('\"' x 100000) . '"'; print "NOT MATCHED!\n" unless /^ " (?: [^"\\]++ | \\. )*+ " /x ; # as recommended by perlre(1) # why does this use recursion (rather than iteration)? #Output under perl v5.24.1: # Complex regular subexpression recursion limit (32766) exceeded at li +ne 6. # NOT MATCHED!

Replies are listed 'Best First'.
Re: What regex can match arbitrary-length quoted string?
by dave_the_m (Monsignor) on Sep 28, 2017 at 22:47 UTC
    The regex engine doesn't recurse these days, but the error message hasn't changed. It does still have to store a backtracking state for each iteration, and there's still a hard limit of 32768-ish backtrack states hard-coded in for this op type, which aught to be changed at some point.

    Dave.

Re: What regex can match arbitrary-length quoted string?
by AnomalousMonk (Archbishop) on Sep 29, 2017 at 02:51 UTC

    Here's an example based on the pre-5.10  (?>pattern) atomic grouping. It should be fairly easily convertable to use the  + possessive quantifier modifier. (I'm using  'x' in place of  '"' because Windose command line and my REPL don't like double-quotes, but you get the idea.)

    c:\@Work\Perl\monks>perl -wMstrict -le "print 'perl version: ', $]; print '------------'; ;; my $non_x_or_escaped = qr{ (?> [^x\\]{0,32766}) (?> (?> \\.){0,32766} (?> [^x\\]{0,32766})) }xms; ;; my $s = 'zx' . ('\x' x 1_000_000) . 'xz'; print 'length of test string: ', length $s; print substr($s, 0, 20), ' ... ', substr($s, -20); ;; print 'MATCHED' if $s =~ m{ (x (?> $non_x_or_escaped)* x) }xms; my $captured = $1; print 'length captured: ', length $captured; print substr($captured, 0, 20), ' ... ', substr($captured, -20); print '------------'; ;; $s = 'zx' . ('p' x 1_000_000) . '\x' . ('q' x 1_000_000) . 'xz'; print 'length of test string: ', length $s; print substr($s, 0, 20), ' ... ', substr($s, -20); ;; print 'MATCHED' if $s =~ m{ (x (?> $non_x_or_escaped)* x) }xms; $captured = $1; print 'length captured: ', length $captured; print substr($captured, 0, 20), ' ... ', substr($captured, -20); " perl version: 5.008009 ------------ length of test string: 2000004 zx\x\x\x\x\x\x\x\x\x ... \x\x\x\x\x\x\x\x\xxz MATCHED length captured: 2000002 x\x\x\x\x\x\x\x\x\x\ ... x\x\x\x\x\x\x\x\x\xx ------------ length of test string: 2000006 zxpppppppppppppppppp ... qqqqqqqqqqqqqqqqqqxz MATCHED length captured: 2000004 xppppppppppppppppppp ... qqqqqqqqqqqqqqqqqqqx


    Give a man a fish:  <%-{-{-{-<

Re: What regex can match arbitrary-length quoted string?
by Anonymous Monk on Sep 28, 2017 at 22:48 UTC

    Perl regex engine tries to avoid (endless) backtracking and/or recursion as much as possible. It also has implementation limits or kludges (depending on how you see it). The repeat count(er) seems to be a 16-bit value. Indeed, if you specify a {1,77777} qualifier, you'd see an error

    Quantifier in {,} bigger than 32766 in regex; ...

    The warning message ("recursion limit exceeded") itself is misleading. There is no recursion involved in your regex, only iteration. The message is the same in either case.

    To work around the limitation, you could double the repeat qualifiers,

    (?: (?: ... )+ )*+
    or similar. Explicit bound {1,32000} is probably better as it gets rid of the warning, as well.

Re: What regex can match arbitrary-length quoted string?
by QM (Parson) on Sep 29, 2017 at 09:48 UTC
    It seems the \\., which captures the \" in the target, is not possessive, and is causing the error.

    The following works in the given case. Are there cases where it fails?

    print "NOT MATCHED!\n" unless /^ " (?: [^"\\]++ | (?: \\. )++ )*+ " /x ;

    Update:

    Playing around more, it seems that the possessives are not needed on the internals, but only that the \\. should have a quantifier:

    print "NOT MATCHED!\n" unless /^ " (?: [^"\\]+ | (?: \\. )+ )*+ " /x ;
    This works on a string of length 100 million, in <2s on my machine. (I didn't try any longer strings.)

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Assuming this is correct...

      Is this a simple omission or transcription error in perlre?

      Besides the entry in perlre, where else is the original solution given?

      Should it be updated or annotated?

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

      Your update doesn't solve the problem.
      $_ = '"' . ('x\"' x 100000) . '"'; print "NOT MATCHED!\n" unless /^ " (?: [^"\\]+ | (?: \\. )+ )*+ " /x; __END__ Complex regular subexpression recursion limit (32766) exceeded at foo +line 4. NOT MATCHED!
        Hmmm...I get "NOT MATCHED!", but no error message. But there are other inconsistencies.

        Putting several into a single script:

        #!/usr/bin/perl use strict; use warnings; our $quotes = '"' . ('\"' x 10000000) . '"'; print "Length of string is ", length($quotes), "\n"; our @qr; push @qr, qr/^ " (?: [^"\\]++ | \\. )*+ " /x ; push @qr, qr/^ " (?: [^"\\]++ | (?: \\. )++ )*+ " /x ; push @qr, qr/^ " (?: [^"\\]+ | (?: \\. )+ )*+ " /x ; for my $i (0..$#qr) { print "$i) $qr[$i] ==> "; print "NOT " unless $quotes =~ $qr[$i]; print "MATCHED!\n"; } __END__ Length of string is 20000002 Complex regular subexpression recursion limit (32766) exceeded at ./pm +1200316.pl line 16. 0) (?^x:^ " (?: [^"\\]++ | \\. )*+ " ) ==> NOT MATCHED! 1) (?^x:^ " (?: [^"\\]++ | (?: \\. )++ )*+ " ) ==> MATCHED! 2) (?^x:^ " (?: [^"\\]+ | (?: \\. )+ )*+ " ) ==> MATCHED!
        perl -v This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-li +nux-gnu-thread-multi (with 58 registered patches, see perl -V for more detail) Copyright 1987-2015, Larry Wall

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

        Typo there: $_ = '"' . ('x\"' x 100000) . '"'; presumably should be $_ = '"' . ('\\"' x 100000) . '"';

        (Yes, it sometimes takes me a *long* time to read my email...)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1200316]
Approved by hippo
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-26 00:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found