Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

The recent discussions about whether or not to use /o and when to use qr// have struck a nerve with me. In fact, I've become so irked with all the poorly justified conclusions that I examined the source in question, traced it with gdb and found out exactly what happens. After reading this document you should understand what happens when you interpolate, use qr//, literals and some other odd things as regular expressions. The purpose to this is so that all the silly "debates" about whether one use of qr// is faster than /o can be put to rest and we can get onto other, sillier and more important things.

  1. Regexp Matching
  2. Regexp guts
  3. Examples
  4. Benchmark

back

Regexp matching

I'm told that /o appeared in perl 4 as a solution so that regular expressions would not need to be repeatedly recompiled when they used variable interpolation. With the advent of perl5 the qr// construct wholely replaces the /o flag. In conversation with BrowserUK it occurred to me that /o can be considered like a sort of "Delayed constant expression". So instead of just writing / ... / in your source you could write $some_regex = ' ... '; /$some_regex/o and whatever $some_regex was equal to would become the constant expression. This gives you capability to define your expression elsewhere and re-use it in a matching operation. The key idea here is that because it is a constant, once defined it can never be changed. Ever. (that is, until perl exits and you start a new script)

$BIG_REGEX = " ... "; sub do_something { # The expression is "fixed" into place on first use. # !! Its a constant. if ( $sexpert =~ m/$BIG_REGEX/o) { ... } } do_something(); # the expression is compiled and fixed into place do_something(); # the compiled expression is directly re-used

Its worth noting that in perl 5 the expression engine already optimizes by checking to see if the regex has changed from one invocation to the next. So if I wrote $some_regex = ' ... '; do_something while /$some_regex/; the expression would only be compiled once. The engine tests the previously compiled regex to see if the new string is equal using the `eq' operator. So even without /o, and qr//, perl still avoids recompiling when necessary. You do get the overhead of the string-equality test. Its a small but non-zero price.

$HUGE_REGEX = " ... "; sub do_it { # The expression is recompiled whenever $HUGE_REGEX changes if ( $sexpert =~ m/$HUGE_REGEX/) { ... } } do_it(); # the expression is compiled do_it(); # the string didn't change so the compiled regex is re-used $HUGE_REGEX = " !!! "; do_it(); # the string changed so the regex is recompiled. do_it(); # the new regex re-used.

The qr// operator (defined in perlop) allows you to have $HUGE_REGEX be precompiled during the script compilation time (instead of runtime - this might be important for mod_perl users) and potentially greater control over expression changes. By moving to the qr// operator I've just allowed the expression to be changed since now the code will match whatever the current value is. In the previous example if I changed $BIG_REGEX after using it once, the change would not be reflected in the match being tested. I declared the expression using my() so that it couldn't be altered by other code. The replacement code with $qr is alteration-friendly. In general I think that qr// method is the Right Way to solve any problem that might also be handled by /o. Its just more flexible and the intent is easier to deal with. The alternative is code that is easier to break.

# I need a better way to phrase this. Ideas? $BIGGER_REGEX = qr/ ... /; # The contents of qr// are a constant expre +ssion so are compiled at compile-time. sub do_it_again { if ( $sexpert =~ $BIGGER_REGEX ) { ... } } do_it_again(); # the pre-compiled regex is used directly. do_it_again(); # ditto # This qr// was compiled a compile-time. Store a copy in $BIGGER_REGEX +. No recompilation # occurs now. $BIGGER_REGEX = qr/ !!! /; do_it_again(); # Like before, a pre-compiled regex is used directly. N +othing special happens.

It should be immediately apparent that using /o on a regex with *no* interpolation has zero effect. You could use it on a regex that uses interpolation but then you get delayed runtime effects. In general I'd much prefer all of my compilation to occur at compilation time. As a matter of practice string evals are restricted, this works similarly. If you use qr// expressions then you've shunted the regex compilation and syntax checking off to the normal script compilation time. The alternative is that you discover a syntax error at runtime. Yuck.

back

Regexp guts

And now for some of the inside dirt. When you use regular expressions you are typically using one or more of the internal op codes "match"/"subst", "regcomp", "regcreset" and "qr". match and subst are the bare-bones functions underlying m// and s///. The other op codes are in support of them. regcomp is what is responsible for compiling the regex as needed. regcreset indicates whether the expression was interpolated and qr returns a 'Regexp' blessed scalar with a compiled regex in its guts. The key here is to see what triggers when.

match / substr: Both of these opcodes expect to have been given a compiled regular expression. They execute it and do all the actual work of a =~ s/// or =~ m// match. The operation here is best described by referring back to the manual in perlre and perlop. I won't detail what the regexp engine actually does because I don't know that myself!

regcomp: This is the key operation for most of the controversy here. When given a compiled expression from a qr// object it returns that directly. When given a string and a compiled expression, it checks to see that they are the same. If they are string-equal then it returns the compiled expression. If they are non equal then it invokes the regular expression compiler. This can happen at run-time somewhat like string-eval. All the usual caveats on syntax errors apply here - you won't find out about them until the expression is actually run. Invoking the compiler is an expensive operation and shouldn't be done in time critical sections of code.

Adding the /o flag does a special thing. Once regcomp gets a suitable compiled expression it wires that expression directly into the match / subst operation and then removes itself. So in essense - the perl program re-writes itself to remove the compilation stage on the fly.

regcreset: This is a dirt cheap operation. It resets the regular expression' notion of whether there was any interpolation at all. This is only used to prevent malicious use of (?{ }) in interpolated expressions. See re 'eval' for more detail on this.

qr: This takes an already compiled regular expression, constructs a new blessed object and attaches the expression to the object with some magic. This is somewhat equivalent to saying bless \..., 'Regex'. You'll note that if you didn't use interpolation inside the qr// expression that all the actual regex compilation already happened at script-compilation time. In fact, the rules for when and how a qr// expression is compiled are exactly the same for a m// operation. The idea here is that if you give a qr// compiled expression to a m// or s/// operation then no runtime compilation is required at all.

back

Examples

back to Examples

Constant Expression 1

This shows a normal constant expression match. The only operation of interest is match.

$data =~ m/[spectal]{9}/; # Match against $data # match(/"[spectal]{9}"/)

back to Examples

Simple Interpolatio 1

This shows a plain string and then what happens when the string is used to match against $data. No special attempt was made for precompilation or anything. Note that though I wrote the match differently for each case, the same program is generated. regcomp is going to be fully slow because it has to compile the given expression. Once compiled it stores a copy with that match operation. The unfortunate case is that both of these are separate operations so second cannot reuse the compilation of the first.

$re = "[spectal]{9}"; $data =~ $re; # match() # regcomp() # Compile the expression # regcreset $data =~ m/$re/; # match() # regcomp() # Compile the expression # regcreset

back to Examples

Constant expression 2

The expression inside qr// was compiled during BEGIN and is merely used in the match. The key difference between this and the previous examples is that regcomp will immediately exit because it was given a qr// precompiled object. So regcomp will be a super fast operation in this case.

$qr = qr/[spectal]{9}/; # This was compiled during BEGIN{} and has no +runtime effect. $data =~ $qr; # match() # regcomp() # Re-use the expression # regcreset $data =~ m/$qr/; # match() # regcomp() # Re-use the expression # regcreset

back to Examples

Interpolation 2

This shows how qr// can compile an expression at runtime. The later uses of the precomiled qr// object conform to the same rules as described in Constant Expression 2

$re = '[spectal]{9}'; $qr = qr/$re/; # Compile the expression # qr() # regcomp() # Compile the expression # regcreset $data =~ $qr; # match() # regcomp() # Re-use the expression # regcreset $data =~ m/$qr/; # match() # regcomp() # Re-use the expression # regcreset

back to Examples

Concatenated Interpolation

In this case I show that just because some portions of an expression were precompiled, those fragments are not reused and for recompilation purposes, the entire expression is examined. So either the entire thing is precompiled or not. The first example has only two fragments both of which are precompiled but this is not different than no precompilation.

$qr_a = qr/\w/; # Pre-compiled during BEGIN {} $qr_b = qr/\d/; # Pre-compiled during BEGIN {} # Stringify both $qr_a and $qr_b then compile new regex. The previous +pre-compilation is not used $data =~ m/$qr_a$qr_b/; # match() # regcomp() # Compile the expression # regcreset $re_a = "\\w"; $re_b = "\\d"; $data =~ m/$re_a$re_b/; # match() # regcomp() # Compile the expression # regcreset

back to Examples

Postponed Constant Expressions 1

This example is like Interpolation 1 except that with the /o flag the regcomp() flag is removed from the program after execution. I think regcreset remains.

$re = '[spectal]{9}'; $data =~ m/$re/o; # match() # regcomp() # Compile the expression and remove this step # regcreset

back to Examples

Simple interpolation -> Literal 2

This example mirrors Interpolation 2 except that the regcomp operation is removed from the qr// line.

$re = '[spectal]{9}'; $qr = qr/$re/o; # qr() # regcomp() # Compile the expression and remove this step # regcreset $data =~ m/$re/; # qr() # regcomp() # Re-use the expression # regcreset

back to Examples

Simple interpolation -> Literal 3

Stepping off from where the preceding example this goes one step further and removes the already really fast regcomp step from the match operation

$qr = ... $data =~ m/$qr/o; # match() # regcomp() # Associate the precompiled expression and remove this +step # regcreset

back

Benchmark

There's a certain monk who I don't think would let me get away with writing this without some benchmarking information so here you go.

# This is a convenient source of data to match against. (Plato's _The_ +Republic_) $data = q[I WENT down yesterday to the Piraeus with Glaucon the son of + Ariston, that I might offer up my prayers to the goddess; and also be +cause I wanted to see in what manner they would celebrate the festiv +al, which was a new thing. I was delighted with the procession of the inhabitants; but that of the Thracians was equally, if not more, beautiful. When we had finished our prayers an +d viewed the spectacle, we turned in the direction of the city +; and at that instant Polemarchus the son of Cephalus chanced to catch sight of us from a distance as we were starting on +our way home, and told his servant to run and bid us wait for hi +m. The servant took hold of me by the cloak behind, and said: Polemarchus desires you to wait.]; # Both of these constructs produce the same structure. Any benchmarkin +g # differences between these should be attributed to system noise. qr() + merely # copies the already-compiled expression into the target. Perl has alr +eady compiled # the regular expression during the initial parsing so this qr[] can b +e reused # and is a very fast assignment. sub qr_0_a { $qr = qr[[spectal]{9}]; 1 } sub qr_0_b { $qr = qr[[spectal]{9}]o; 1 } cmpthese( 0, { qr_0_a => \&qr_0_a, qr_0_b => \&qr_0_b } ); # Benchmark: running qr_0_a, qr_0_b, each for at least 3 CPU seconds.. +. # qr_0_a: 4 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) +@ 14585.21/s (n=45360) # qr_0_b: 4 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) +@ 14631.94/s (n=45359) # Rate qr_0_a qr_0_b # qr_0_a 14585/s -- -0% # qr_0_b 14632/s 0% -- # This demonstrates that in both cases it is a simple assignment. *No* + compilation # occurs and the /o flag is completely useless here. sub qr_1_a { $qr = qr[[spectal]{9}]; $data =~ $qr; 1 } sub qr_1_b { $qr = qr[[spectal]{9}]o; $data =~ $qr; 1 } cmpthese( 0, { qr_1_a => \&qr_1_a, qr_1_b => \&qr_1_b } ); # Benchmark: running qr_1_a, qr_1_b, each for at least 3 CPU seconds.. +. # qr_1_a: 3 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ +2692.50/s (n=8616 # qr_1_b: 4 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ +2684.11/s (n=8616 # Rate qr_1_b qr_1_a # qr_1_b 2684/s -- -0% # qr_1_a 2692/s 0% -- # Again, no difference that is not attributable to system noise. sub qr_1_c { $qr = qr[[spectal]{9}]; $data =~ m/$qr/o; 1 } sub qr_1_d { $qr = qr[[spectal]{9}]o; $data =~ m/$qr/o; 1 } cmpthese( 0, { qr_1_a => \&qr_1_a, qr_1_b => \&qr_1_b, qr_1_c => \&qr_1_c, qr_1_d => \&qr_1_d } ); # Benchmark: running qr_1_a, qr_1_b, qr_1_c, qr_1_d, each for at least + 3 CPU seconds... # qr_1_a: 3 wallclock secs ( 3.12 usr + 0.00 sys = 3.12 CPU) @ +2702.56/s (n=8432) # qr_1_b: 2 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ +2693.93/s (n=8432) # qr_1_c: 5 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ +2747.35/s (n=8819) # qr_1_d: 2 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ +2744.55/s (n=8810) # Rate qr_1_b qr_1_a qr_1_d qr_1_c # qr_1_b 2694/s -- -0% -2% -2% # qr_1_a 2703/s 0% -- -2% -2% # qr_1_d 2745/s 2% 2% -- -0% # qr_1_c 2747/s 2% 2% 0% -- # This shows the very slight difference between c/d and a/b. In a/b # the regcomp on the match() is a very fast operation. In c/d every ma +tch() # except the first has no regcomp() operation. # Appending /o on a qr// expression has the same effect as on a m// # expression. In the following code the qr// expression only compiles +once # and all repetitions after the first go-round merely copy the Regex o +bject. # So the effect is only relevant if you are interpolating something in +to your # regular expression. As always, compilation only occurs the first tim +e and then # /o prevents the compilation from occuring again. Without /o the expr +ession would # be properly recompiled each time 'round. The difference is between h +aving $qro # reflect the currently interpolated $_ variable and having it permena +ntly fixed as # '9' which is the first value to be used. I don't know of any circums +tance when this # is the required behaviour so I'd categorize any use of /o with qr// +as a bug. # Count backwards from 9 for(qw( 9 8 7 6 5 4 3 2 1 0 )) { $qro = qr[[spectal]{$_}]o; } if ($qro ne "(?-xism:[spectal]{9})") { die; } # $qro is equal to (?-xism:[spectal]{9,9}) which demonstrates that the # compilation step was removed after its one-time-only execution. # All three of these "variations" produce the same data and follow the + same # execution path. Any speed differences should be attributed to system + noise # as in a very real sense - they are 100% identical. Since $qr and $qr +o contain # precompiled regular expressions is is *nearly* as efficient as a pla +in match like # $data =~ /[spectal]{9}/. The overhead is the regcreset() and regcomp +() operations. # Since $qr and $qro already contain compiled regular expressions regc +omp() skips # all the compilation and returns very quickly. This is a low-overhead + operation. $qr = qr[[spectal]{9}]; $qro = qr[[spectal]{9}]o; sub qr_2_a { $data =~ $qr } sub qr_2_b { $data =~ $qro } sub qr_2_c { $data =~ /$qr/ } sub qr_2_d { $data =~ /$qro/ } cmpthese( 0, { qr_2_a => \&qr_2_a, qr_2_b => \&qr_2_b, qr_2_c => \&qr_2_c, qr_2_d => \&qr_2_d } ); # Benchmark: running qr_2_a, qr_2_b, qr_2_c, qr_2_d, each for at least + 3 CPU seconds... # qr_2_a: 4 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ +3850.47/s (n=12206) # qr_2_b: 2 wallclock secs ( 3.07 usr + 0.00 sys = 3.07 CPU) @ +3812.70/s (n=11705) # qr_2_c: 3 wallclock secs ( 3.15 usr + 0.00 sys = 3.15 CPU) @ +3836.51/s (n=12085) # qr_2_d: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ +3812.30/s (n=12085) # Rate qr_2_d qr_2_b qr_2_c qr_2_a # qr_2_d 3812/s -- -0% -1% -1% # qr_2_b 3813/s 0% -- -1% -1% # qr_2_c 3837/s 1% 1% -- -0% # qr_2_a 3850/s 1% 1% 0% -- # A small difference but its just system noise. There is no actual dif +ference here. # All of these "variations" produce the same data and follow the same # execution path. Any speed differences should be attributed to system + noise # as in a very real sense - they are 100% identical. Since $qr and $qr +o contain # precompiled regular expressions is is *nearly* as efficient as a pla +in match like # $data =~ /[spectal]{9}/. The overhead is the regcreset() and regcomp +() operations. # Since $qr and $qro already contain compiled regular expressions regc +omp() skips # all the compilation and returns very quickly. This is a low-overhead + operation. sub qr_2_e { $data =~ /$qr/o } sub qr_2_f { $data =~ /$qro/o } cmpthese( 0, { qr_2_a => \&qr_2_a, qr_2_b => \&qr_2_b, qr_2_c => \&qr_2_c, qr_2_d => \&qr_2_d, qr_2_e => \&qr_2_e, qr_2_f => \&qr_2_f } ); # Benchmark: running qr_2_a, qr_2_b, qr_2_c, qr_2_d, qr_2_e, qr_2_f, e +ach for at least 3 CPU seconds... # qr_2_a: 4 wallclock secs ( 3.26 usr + 0.00 sys = 3.26 CPU) @ +3864.72/s (n=12599) # qr_2_b: 5 wallclock secs ( 3.05 usr + 0.00 sys = 3.05 CPU) @ +3880.33/s (n=11835) # qr_2_c: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ +3858.52/s (n=12000) # qr_2_d: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ +3858.52/s (n=12000) # qr_2_e: 4 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ +3950.16/s (n=12206) # qr_2_f: 5 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ +3911.00/s (n=12085) # Rate qr_2_d qr_2_c qr_2_a qr_2_b qr_2_f qr_2_e # qr_2_d 3859/s -- -0% -0% -1% -1% -2% # qr_2_c 3859/s 0% -- -0% -1% -1% -2% # qr_2_a 3865/s 0% 0% -- -0% -1% -2% # qr_2_b 3880/s 1% 1% 0% -- -1% -2% # qr_2_f 3911/s 1% 1% 1% 1% -- -1% # qr_2_e 3950/s 2% 2% 2% 2% 1% -- # e/f are in reality ever so slightly faster than a/b/c/d which are al +l equivalent. # This just shows that the speed gain is small enough to disappear rig +ht into # system noise. # If the /o flag is added onto /$qr/ then that expression becomes fore +ver bound to # whatever value $qr contained at the time it was first executed. It t +urns out that /o # works by having the regcomp() operation remove itself from the execu +ting program.

In reply to /o is dead, long live qr//! by diotalevi

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-03-29 02:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found