Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

How are we lazy?

by Ovid (Cardinal)
on Jul 19, 2000 at 00:32 UTC ( [id://23056]=perlquestion: print w/replies, xml ) Need Help??

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

I've been concentrating many of my responses lately on regex-related questions as I wish to be at one with Perl's regex engine. To that end, I've been playing around with some benchmarks and have come across a result that I do not understand.

Background: Using dot plus a quantifier in a regex is a tricky business as it's not terribly precise. Worse, when iterated over, it's a performance killer. For example: /".+"/ will cause the .+ to grab as much stuff as it can, but then is forced to backtrack to that final quote, killing efficiency. Using something like /"[^"]+"/ keeps pulling in characters until we get to a quote. No backtracking is involved and it's usually more efficient as a result.

When matching against a string with more than one quote, the people often make it lazy by adding the question mark after the quantifier (/".+?"/). Since I cannot find a reference to the actual mechanics of making a quantifier lazy, I ran some benchmarks to figure out what was going on. I felt that lazy was probably less efficient than greedy due to the extra work the regex compiler needed to do to verify that it was making a minimal match. Most of my benchmarks bore that out, until now.

#!/usr/bin/perl -w use strict; use vars qw(@myvar $iteration); use Benchmark; @myvar = ('asdf"' . 'z' x 2000 . '"asdf', 'z' x 2000 . '"asdf"', '"asdf"' . 'z' x 2000, 'z' x 2000 . '"' . 'z' x 2000 . '"' . 'z' x 2000 ); for $iteration (0 .. (@myvar-1)) { print "Iteration: $iteration\n"; timethese(500000, { Greedy => '$myvar[$iteration] =~ /".+"/', Negated => '$myvar[$iteration] =~ /"[^"]+"/' }); }; Iteration: 0 Benchmark: timing 500000 iterations of Greedy, Negated... Greedy: 13 wallclock secs (13.84 usr + 0.00 sys = 13.84 CPU) Negated: 50 wallclock secs (49.99 usr + 0.00 sys = 49.99 CPU) Iteration: 1 Benchmark: timing 500000 iterations of Greedy, Negated... Greedy: 18 wallclock secs (17.68 usr + 0.00 sys = 17.68 CPU) Negated: 17 wallclock secs (17.02 usr + 0.00 sys = 17.02 CPU) Iteration: 2 Benchmark: timing 500000 iterations of Greedy, Negated... Greedy: 46 wallclock secs (46.75 usr + 0.00 sys = 46.75 CPU) Negated: 2 wallclock secs ( 1.92 usr + 0.00 sys = 1.92 CPU) Iteration: 3 Benchmark: timing 500000 iterations of Greedy, Negated... Greedy: 73 wallclock secs (73.60 usr + 0.00 sys = 73.60 CPU) Negated: 65 wallclock secs (64.97 usr + 0.00 sys = 64.97 CPU)
Now if we switch the first regex line to the following:
Lazy => '$myvar[$iteration] =~ /".+?"/',
We find that we get results similar to these:
Iteration: 0 Benchmark: timing 500000 iterations of Lazy, Negated... Lazy: 184 wallclock secs (183.51 usr + 0.00 sys = 183.51 CPU) Negated: 49 wallclock secs (49.59 usr + 0.00 sys = 49.59 CPU) Iteration: 1 Benchmark: timing 500000 iterations of Lazy, Negated... Lazy: 18 wallclock secs (17.96 usr + 0.00 sys = 17.96 CPU) Negated: 17 wallclock secs (17.03 usr + 0.00 sys = 17.03 CPU) Iteration: 2 Benchmark: timing 500000 iterations of Lazy, Negated... Lazy: 2 wallclock secs ( 2.26 usr + 0.00 sys = 2.26 CPU) Negated: 1 wallclock secs ( 1.92 usr + 0.00 sys = 1.92 CPU) Iteration: 3 Benchmark: timing 500000 iterations of Lazy, Negated... Lazy: 199 wallclock secs (198.66 usr + 0.00 sys = 198.66 CPU) Negated: 64 wallclock secs (64.59 usr + 0.00 sys = 64.59 CPU)
There's nothing too surprising there. Iteration 1 is about even. Iterations 0 and 3 clearly show that lazy is less efficient than greedy. But what the heck is going on with iteration 2? Lazy just beat the pants off of greedy. It went from being even or less efficient to all of a sudden running about 25 times faster. Clearly, there is something internal to the regex engine that's doing something here, but I can't figure it out. Help!

Replies are listed 'Best First'.
RE: How are we lazy?
by tye (Sage) on Jul 19, 2000 at 00:50 UTC

    In interation 2, Lazy only has to "forwardtrack" 4 times and then matches and doesn't even look at the last 2000 characters of the string.

    I'm disappointed that Lazy doesn't seem to be optimized very well, with the simplest of "forwardtracking" to find the closing " going much, much slower than matching [^"]+"

      Thank you! I don't know why I didn't think of forward tracking. Looking at all of the other results in light of this, they make perfect sense (even though I thought they were a little odd before). Tomorrow, you will get a ++ vote from me :)

      Actually, I think lazy has to be slower than a negated character class. Basically, either greedy or lazy quantifiers using the dot metacharacter are forced to back- or forwardtrack. That nasty little dot will match with anything that it can get its grubby little hands on and will cause the regex to "overshoot" which greatly increases the amount of necessary backtracking or forwardtracking.

      With the character class, Perl creates what amounts to a seive. Characters which match (or don't match, in the case of a negated character class) pass through the seive. When we get a non-allowed character, it stops, period. There is no "tracking" which occurs afterwards to clean up.

Re: How are we lazy?
by tye (Sage) on Jul 19, 2000 at 04:03 UTC

    Quick conjecture time (like usual)...

    The greedy operators are optimized. They figure out what characters could occur directly after their match. So in the case of /^.*"/, the .* part knows that it will end right before a " and so does the equivalent of a rindex() to find the last " in the string. Then it lets the rest of the regex try to match. If that fails, then it backtracks to the previous ".

    I always assumed that the non-greedy operators were optimized the same way. But based on your benchmarks, I've changed my mind. It looks like, in the case of /^.*?"/, that the .*? part doesn't compute what it might be followed by and just starts out matching nothing and letting the rest of the regex try to match. If this doesn't work, it forwardtracks to the next character. It should probably instead do the equivalent of index() at first and then forwardtrack to the next ", in this case.

    Sorry, I don't have time to dig into the regex engine code right now. This would probably be an "easy" patch except for the fact that even easy patches to the regex engine require 12th-level dieties.

      The greedy operators are optimized.

      If they are optimized, this is new in 5.6. What happens with something like /^.*"/ is the dot star will match everything up to the end of the string. Then, when the regex "realizes" that it has something to match after the dot star, it starts backtracking. This is involves going to the previous character and checking to see if it matches and repeating this until a match occurs (it's more complicated than that, but that's the gist of it). With a complicated string and regex, this backtracking gets quite hairy. That's why it's best to avoid using the dot metacharacter when appropriate.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://23056]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2024-04-16 14:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found