Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: Help with Double Double Quotes regular expression (precise)

by bart (Canon)
on Apr 02, 2007 at 10:51 UTC ( [id://607798]=note: print w/replies, xml ) Need Help??


in reply to Re: Help with Double Double Quotes regular expression (precise)
in thread Help with Double Double Quotes regular expression

At least in Perl, I'd do this instead:
/"((?>[^"]+|"")*)"/
As soon as you can see nested quantifiers, it should raise a red flag: in the worst case, this can get very slow.

Only last week, I've had to deal with a similar nested quantifier regexp in Javascript, where processing of function with a regexp slowed down from 15ms, typical case, to 8 seconds in a bad case. That's a slowdown factor of 500. And it could even have been worse.

And unfortunately, Javascript doesn't know (?>pattern). Perl does. Use it when appropriate.

Replies are listed 'Best First'.
Re^3: Help with Double Double Quotes regular expression (imprecise)
by tye (Sage) on Apr 02, 2007 at 16:27 UTC

    Actually, part of the point of being precise in a regex is that it prevents pathological performance problems even in the face of nested quantifiers; even they can't get really slow. If the regex is "precise", then it can only match one way and it will never thrash trying a ton of different combinations, even if used as part of some larger regex. It also means that it won't surprise you by matching in some unexpected way.

    The regex can backtrack out of what it tried to match but each step will find no alternative way to move forward and so it will backtrack out very directly. The use of (?>...) will only slightly speed up such a case.

    How you verify that a regex is precise is you look at each decision point in the regex and verify that there is only one way to move forward from there. So if you have /A(B|C)*D/, after you've matched A your regex will try the following ways to move forward (in this order): 1) Try to match a longer version of A, 2) Match B, 3) Match C, 4) Match D.

    If I'd not been in a bit of hurry when I'd posted, I would have done this and noticed my mistake. So let's look at the well-trodden territory of matching quotes when \" is used to esacpe the quote instead of "".

    So in the case of /"((?:[^\\"]+|\\")*)"/, A is ", B is [^\\"]+, C is \\", and D is ". So A can't be matched in any longer way (since it has no quantifiers). C can only match if the next character is a backslash. D can only match if the next character is a double quote. Finally, B can only match if the next character is neither \ nor ", so that decision point is unambiguous. Note that the use of + is important here. /"((?:[^\\"]*|\\")*)"/ (note the + was replaced with *) is not precise.

    Some who have read Mastering Regular Expressions would advise you to avoid the nested quantifiers and instead use /"((?:[^\\"]|\\")*)"/. That version is also precise. The main problem with it is a Perl quirk that prevents a regex part from matching more than 32K times and so that regex will die if used to try to match a string of more than 32K characters inside quotes. I also guess that it would be slightly slower.

    Now, the problem with /"((?:[^"]+|"")*)"/ is that C ("") and D (") leads to an ambiguity. If I'd bothered to test my own test case, I would have also seen this. But coming up with sufficient test cases is quite a challenge so I find it works better to also check that my regex is precise. So I should have posted:

    /"((?:[^"]+|"")*)"(?!")/

    which is precise but in one case must look at the next two characters before knowing which route must be taken.

    I prefer to not use (?>...) because avoiding it "forces" me to ensure that my regex is precise. But (?>...) can be useful both for preventing pathological performance problems and for preventing some surprises (though I haven't seen it used enough to get a feel for how often it will lead to other types of surprises).

    - tye        

      Well, you might think it shouldn't make much of a difference, but perl happens to disagree with you. I disagree with you too, but it's probably perl that you'll rather listen to. :)

      You see, there's only one way it could match, but there are plenty of ways it can fail (as in "almost succeed"). And a regex engine has a tendency to try them all, it's apparently not yet smart enough to know it can't succeed. Looking at the regex

      /(?:[^"]+|"")*/
      and the string "around", there's plenty of ways this could possibly match:
      (around) (aroun)(d) (arou)(nd) (arou)(n)(d) ...
      Each phrase between parens indicates a group as matched by the subpattern [^"]+, and all groups are matched as a whole by * — as you can see, there are many ways it can be split up.

      If you use the "cut" operator, there's only one way this can match:

      (around)
      So it makes sense to expect that using the cut operator might yield a significant speed boost.

      But, without further ado, here's the benchmark, as run with perl 5.8.8:

      my $s = q("You spin me ""around"" and ""around"", ""round"", like a re +cord, ""round"" and ""around"".); use Benchmark 'cmpthese'; cmpthese(-3, { cut => sub { return $s =~ /"(?>[^"]+|"")*"(?!")/ }, straight => sub { return $s =~ /"(?:[^"]+|"")*"(?!")/ } });
      Results (neither the figures, nor their ratios), are not always the same, so take with a grain of salt:
      Rate straight cut straight 7579/s -- -89% cut 69921/s 823% --
      That's a factor of 9, that /(?>[^"]+|"")*/ is faster than /(?:[^"]+|"")*/, for this string, and that is definitely not insignificant.

        Thanks. Excellent points.

        Back to /A(B|C)*D/, I have an ambiguity after matching B between matching a longer version of B or starting a new match against B. This ambiguity is harder to notice because it is a choice between B and B.

        So, we can remove that ambiguity explicitly via:

        /"((?:[^"]+(?=")|"")*)"(?!")/ # ^^^^^

        which is at least similar to

        /"((?:(?>[^"]+)|"")*)"(?!")/ # ^^^ ^

        but I still like being explicit over disabling backtracking, because I learn things (like what you taught me). Thanks again.

        I (now) recall discussions of this before where it was noted (by tilly) that Perl's regex engine was smarter than many in knowing how to avoid pathological performance in something at least similar to this situation. But checking "use re 'debug';" I see that the pathological case is indeed being triggered.

        Update: I've been known to do the equivalent of:

        if( /"((?:[^"]+|"")*)("?)(?!")/ ) { # ^ ^^ die "Unclosed quote ($1)..." if ! $2;

        which I think also gets rid of the problem.

        - tye        

        Thankyou for this. I've long known about (?>...) and what it did. But I rarely use it because I always have to think long and hard about when and why I would want to use it. I think this example may just have made the penny drop for me.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        I noticed that your sample $s is a non-matching one.

        Did you take that on purpose to emphasise on cutting off backtracking in the failing case?

        When I balance the (double)quotes in $s by adding one in last position (directly before the closing parentheses of q,) "straight" seems to have a slight edge over "cut"

        Rate cut straight cut 110876/s -- -13% straight 127147/s 15% --
        I got similar figures as you for your original $s

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://607798]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-19 12:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found