Re^4: Help with Double Double Quotes regular expression (imprecise)

Well, you might think it shouldn't make much of a difference, but perl happens to disagree with you. I disagree with you too, but it's probably perl that you'll rather listen to. :)

You see, there's only one way it could match, but there are plenty of ways it can fail (as in "almost succeed"). And a regex engine has a tendency to try them all, it's apparently not yet smart enough to know it can't succeed. Looking at the regex

/(?:[^"]+|"")*/
[download]

and the string "around", there's plenty of ways this could possibly match:

(around)
(aroun)(d)
(arou)(nd)
(arou)(n)(d)
...
[download]

Each phrase between parens indicates a group as matched by the subpattern [^"]+, and all groups are matched as a whole by * — as you can see, there are many ways it can be split up.

If you use the "cut" operator, there's only one way this can match:

(around)
[download]

So it makes sense to expect that using the cut operator might yield a significant speed boost.

But, without further ado, here's the benchmark, as run with perl 5.8.8:

my $s = q("You spin me ""around"" and ""around"", ""round"", like a re
+cord, ""round"" and ""around"".);
use Benchmark 'cmpthese';
cmpthese(-3, { 
  cut => sub { return  $s =~ /"(?>[^"]+|"")*"(?!")/ },
  straight => sub { return  $s =~ /"(?:[^"]+|"")*"(?!")/ } 
});
[download]

Results (neither the figures, nor their ratios), are not always the same, so take with a grain of salt:

            Rate straight      cut
straight  7579/s       --     -89%
cut      69921/s     823%       --
[download]

That's a factor of 9, that /(?>[^"]+|"")*/ is faster than /(?:[^"]+|"")*/, for this string, and that is definitely not insignificant.

Comment on Re^4: Help with Double Double Quotes regular expression (imprecise) Select or Download Code

Replies are listed 'Best First'.
Re^5: Help with Double Double Quotes regular expression (imprecise) by tye (Sage) on Apr 02, 2007 at 22:38 UTC
Thanks. Excellent points. Back to `/A(B\|C)D/`, I have an ambiguity after matching B between matching a longer version of B or starting a new match against B. This ambiguity is harder to notice because it is a choice between B and B. So, we can remove that ambiguity explicitly via: `/"((?:[^"]+(?=")\|""))"(?!")/ # ^^^^^` [download] which is at least similar to `/"((?:(?>[^"]+)\|""))"(?!")/ # ^^^ ^` [download] but I still like being explicit over disabling backtracking, because I learn things (like what you taught me). Thanks again. I (now) recall discussions of this before where it was noted (by tilly) that Perl's regex engine was smarter than many in knowing how to avoid pathological performance in something at least similar to this situation. But checking "`use re 'debug';`" I see that the pathological case is indeed being triggered. Update:* I've been known to do the equivalent of: `if( /"((?:[^"]+\|"")*)("?)(?!")/ ) { # ^ ^^ die "Unclosed quote ($1)..." if ! $2;` [download] which I think also gets rid of the problem. - tye	[reply] [d/l] [select]
Re^5: Help with Double Double Quotes regular expression (imprecise) by BrowserUk (Patriarch) on Apr 02, 2007 at 22:28 UTC
Thankyou for this. I've long known about `(?>...)` and what it did. But I rarely use it because I always have to think long and hard about when and why I would want to use it. I think this example may just have made the penny drop for me. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^5: Help with Double Double Quotes regular expression (failing) by pKai (Priest) on Apr 02, 2007 at 22:23 UTC
I noticed that your sample `$s` is a non-matching one. Did you take that on purpose to emphasise on cutting off backtracking in the failing case? When I balance the (double)quotes in `$s` by adding one in last position (directly before the closing parentheses of `q`,) "straight" seems to have a slight edge over "cut" `Rate cut straight cut 110876/s -- -13% straight 127147/s 15% --` [download] I got similar figures as you for your original `$s`	[reply] [d/l] [select]
Re^6: Help with Double Double Quotes regular expression (failing) by bart (Canon) on Apr 02, 2007 at 22:28 UTC
I noticed that your sample $s is a non-matching one. Did you take that on purpose to emphasise on cutting off backtracking in the failing case? I did. Because it's a well known fact (see the owl book AKA Jeffrey Friedl's "Mastering Regular Expressions", for example) that the pathetic case typically rears its ugly head when matches fail. That was the case in my Javascript code, too.	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks