Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

/o is dead, long live qr//!

by diotalevi (Canon)
on Jun 25, 2003 at 22:05 UTC ( [id://269035]=perlmeditation: print w/replies, xml ) Need Help??

The recent discussions about whether or not to use /o and when to use qr// have struck a nerve with me. In fact, I've become so irked with all the poorly justified conclusions that I examined the source in question, traced it with gdb and found out exactly what happens. After reading this document you should understand what happens when you interpolate, use qr//, literals and some other odd things as regular expressions. The purpose to this is so that all the silly "debates" about whether one use of qr// is faster than /o can be put to rest and we can get onto other, sillier and more important things.

  1. Regexp Matching
  2. Regexp guts
  3. Examples
  4. Benchmark

back

Regexp matching

I'm told that /o appeared in perl 4 as a solution so that regular expressions would not need to be repeatedly recompiled when they used variable interpolation. With the advent of perl5 the qr// construct wholely replaces the /o flag. In conversation with BrowserUK it occurred to me that /o can be considered like a sort of "Delayed constant expression". So instead of just writing / ... / in your source you could write $some_regex = ' ... '; /$some_regex/o and whatever $some_regex was equal to would become the constant expression. This gives you capability to define your expression elsewhere and re-use it in a matching operation. The key idea here is that because it is a constant, once defined it can never be changed. Ever. (that is, until perl exits and you start a new script)

$BIG_REGEX = " ... "; sub do_something { # The expression is "fixed" into place on first use. # !! Its a constant. if ( $sexpert =~ m/$BIG_REGEX/o) { ... } } do_something(); # the expression is compiled and fixed into place do_something(); # the compiled expression is directly re-used

Its worth noting that in perl 5 the expression engine already optimizes by checking to see if the regex has changed from one invocation to the next. So if I wrote $some_regex = ' ... '; do_something while /$some_regex/; the expression would only be compiled once. The engine tests the previously compiled regex to see if the new string is equal using the `eq' operator. So even without /o, and qr//, perl still avoids recompiling when necessary. You do get the overhead of the string-equality test. Its a small but non-zero price.

$HUGE_REGEX = " ... "; sub do_it { # The expression is recompiled whenever $HUGE_REGEX changes if ( $sexpert =~ m/$HUGE_REGEX/) { ... } } do_it(); # the expression is compiled do_it(); # the string didn't change so the compiled regex is re-used $HUGE_REGEX = " !!! "; do_it(); # the string changed so the regex is recompiled. do_it(); # the new regex re-used.

The qr// operator (defined in perlop) allows you to have $HUGE_REGEX be precompiled during the script compilation time (instead of runtime - this might be important for mod_perl users) and potentially greater control over expression changes. By moving to the qr// operator I've just allowed the expression to be changed since now the code will match whatever the current value is. In the previous example if I changed $BIG_REGEX after using it once, the change would not be reflected in the match being tested. I declared the expression using my() so that it couldn't be altered by other code. The replacement code with $qr is alteration-friendly. In general I think that qr// method is the Right Way to solve any problem that might also be handled by /o. Its just more flexible and the intent is easier to deal with. The alternative is code that is easier to break.

# I need a better way to phrase this. Ideas? $BIGGER_REGEX = qr/ ... /; # The contents of qr// are a constant expre +ssion so are compiled at compile-time. sub do_it_again { if ( $sexpert =~ $BIGGER_REGEX ) { ... } } do_it_again(); # the pre-compiled regex is used directly. do_it_again(); # ditto # This qr// was compiled a compile-time. Store a copy in $BIGGER_REGEX +. No recompilation # occurs now. $BIGGER_REGEX = qr/ !!! /; do_it_again(); # Like before, a pre-compiled regex is used directly. N +othing special happens.

It should be immediately apparent that using /o on a regex with *no* interpolation has zero effect. You could use it on a regex that uses interpolation but then you get delayed runtime effects. In general I'd much prefer all of my compilation to occur at compilation time. As a matter of practice string evals are restricted, this works similarly. If you use qr// expressions then you've shunted the regex compilation and syntax checking off to the normal script compilation time. The alternative is that you discover a syntax error at runtime. Yuck.

back

Regexp guts

And now for some of the inside dirt. When you use regular expressions you are typically using one or more of the internal op codes "match"/"subst", "regcomp", "regcreset" and "qr". match and subst are the bare-bones functions underlying m// and s///. The other op codes are in support of them. regcomp is what is responsible for compiling the regex as needed. regcreset indicates whether the expression was interpolated and qr returns a 'Regexp' blessed scalar with a compiled regex in its guts. The key here is to see what triggers when.

match / substr: Both of these opcodes expect to have been given a compiled regular expression. They execute it and do all the actual work of a =~ s/// or =~ m// match. The operation here is best described by referring back to the manual in perlre and perlop. I won't detail what the regexp engine actually does because I don't know that myself!

regcomp: This is the key operation for most of the controversy here. When given a compiled expression from a qr// object it returns that directly. When given a string and a compiled expression, it checks to see that they are the same. If they are string-equal then it returns the compiled expression. If they are non equal then it invokes the regular expression compiler. This can happen at run-time somewhat like string-eval. All the usual caveats on syntax errors apply here - you won't find out about them until the expression is actually run. Invoking the compiler is an expensive operation and shouldn't be done in time critical sections of code.

Adding the /o flag does a special thing. Once regcomp gets a suitable compiled expression it wires that expression directly into the match / subst operation and then removes itself. So in essense - the perl program re-writes itself to remove the compilation stage on the fly.

regcreset: This is a dirt cheap operation. It resets the regular expression' notion of whether there was any interpolation at all. This is only used to prevent malicious use of (?{ }) in interpolated expressions. See re 'eval' for more detail on this.

qr: This takes an already compiled regular expression, constructs a new blessed object and attaches the expression to the object with some magic. This is somewhat equivalent to saying bless \..., 'Regex'. You'll note that if you didn't use interpolation inside the qr// expression that all the actual regex compilation already happened at script-compilation time. In fact, the rules for when and how a qr// expression is compiled are exactly the same for a m// operation. The idea here is that if you give a qr// compiled expression to a m// or s/// operation then no runtime compilation is required at all.

back

Examples

back to Examples

Constant Expression 1

This shows a normal constant expression match. The only operation of interest is match.

$data =~ m/[spectal]{9}/; # Match against $data # match(/"[spectal]{9}"/)

back to Examples

Simple Interpolatio 1

This shows a plain string and then what happens when the string is used to match against $data. No special attempt was made for precompilation or anything. Note that though I wrote the match differently for each case, the same program is generated. regcomp is going to be fully slow because it has to compile the given expression. Once compiled it stores a copy with that match operation. The unfortunate case is that both of these are separate operations so second cannot reuse the compilation of the first.

$re = "[spectal]{9}"; $data =~ $re; # match() # regcomp() # Compile the expression # regcreset $data =~ m/$re/; # match() # regcomp() # Compile the expression # regcreset

back to Examples

Constant expression 2

The expression inside qr// was compiled during BEGIN and is merely used in the match. The key difference between this and the previous examples is that regcomp will immediately exit because it was given a qr// precompiled object. So regcomp will be a super fast operation in this case.

$qr = qr/[spectal]{9}/; # This was compiled during BEGIN{} and has no +runtime effect. $data =~ $qr; # match() # regcomp() # Re-use the expression # regcreset $data =~ m/$qr/; # match() # regcomp() # Re-use the expression # regcreset

back to Examples

Interpolation 2

This shows how qr// can compile an expression at runtime. The later uses of the precomiled qr// object conform to the same rules as described in Constant Expression 2

$re = '[spectal]{9}'; $qr = qr/$re/; # Compile the expression # qr() # regcomp() # Compile the expression # regcreset $data =~ $qr; # match() # regcomp() # Re-use the expression # regcreset $data =~ m/$qr/; # match() # regcomp() # Re-use the expression # regcreset

back to Examples

Concatenated Interpolation

In this case I show that just because some portions of an expression were precompiled, those fragments are not reused and for recompilation purposes, the entire expression is examined. So either the entire thing is precompiled or not. The first example has only two fragments both of which are precompiled but this is not different than no precompilation.

$qr_a = qr/\w/; # Pre-compiled during BEGIN {} $qr_b = qr/\d/; # Pre-compiled during BEGIN {} # Stringify both $qr_a and $qr_b then compile new regex. The previous +pre-compilation is not used $data =~ m/$qr_a$qr_b/; # match() # regcomp() # Compile the expression # regcreset $re_a = "\\w"; $re_b = "\\d"; $data =~ m/$re_a$re_b/; # match() # regcomp() # Compile the expression # regcreset

back to Examples

Postponed Constant Expressions 1

This example is like Interpolation 1 except that with the /o flag the regcomp() flag is removed from the program after execution. I think regcreset remains.

$re = '[spectal]{9}'; $data =~ m/$re/o; # match() # regcomp() # Compile the expression and remove this step # regcreset

back to Examples

Simple interpolation -> Literal 2

This example mirrors Interpolation 2 except that the regcomp operation is removed from the qr// line.

$re = '[spectal]{9}'; $qr = qr/$re/o; # qr() # regcomp() # Compile the expression and remove this step # regcreset $data =~ m/$re/; # qr() # regcomp() # Re-use the expression # regcreset

back to Examples

Simple interpolation -> Literal 3

Stepping off from where the preceding example this goes one step further and removes the already really fast regcomp step from the match operation

$qr = ... $data =~ m/$qr/o; # match() # regcomp() # Associate the precompiled expression and remove this +step # regcreset

back

Benchmark

There's a certain monk who I don't think would let me get away with writing this without some benchmarking information so here you go.

# This is a convenient source of data to match against. (Plato's _The_ +Republic_) $data = q[I WENT down yesterday to the Piraeus with Glaucon the son of + Ariston, that I might offer up my prayers to the goddess; and also be +cause I wanted to see in what manner they would celebrate the festiv +al, which was a new thing. I was delighted with the procession of the inhabitants; but that of the Thracians was equally, if not more, beautiful. When we had finished our prayers an +d viewed the spectacle, we turned in the direction of the city +; and at that instant Polemarchus the son of Cephalus chanced to catch sight of us from a distance as we were starting on +our way home, and told his servant to run and bid us wait for hi +m. The servant took hold of me by the cloak behind, and said: Polemarchus desires you to wait.]; # Both of these constructs produce the same structure. Any benchmarkin +g # differences between these should be attributed to system noise. qr() + merely # copies the already-compiled expression into the target. Perl has alr +eady compiled # the regular expression during the initial parsing so this qr[] can b +e reused # and is a very fast assignment. sub qr_0_a { $qr = qr[[spectal]{9}]; 1 } sub qr_0_b { $qr = qr[[spectal]{9}]o; 1 } cmpthese( 0, { qr_0_a => \&qr_0_a, qr_0_b => \&qr_0_b } ); # Benchmark: running qr_0_a, qr_0_b, each for at least 3 CPU seconds.. +. # qr_0_a: 4 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) +@ 14585.21/s (n=45360) # qr_0_b: 4 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) +@ 14631.94/s (n=45359) # Rate qr_0_a qr_0_b # qr_0_a 14585/s -- -0% # qr_0_b 14632/s 0% -- # This demonstrates that in both cases it is a simple assignment. *No* + compilation # occurs and the /o flag is completely useless here. sub qr_1_a { $qr = qr[[spectal]{9}]; $data =~ $qr; 1 } sub qr_1_b { $qr = qr[[spectal]{9}]o; $data =~ $qr; 1 } cmpthese( 0, { qr_1_a => \&qr_1_a, qr_1_b => \&qr_1_b } ); # Benchmark: running qr_1_a, qr_1_b, each for at least 3 CPU seconds.. +. # qr_1_a: 3 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ +2692.50/s (n=8616 # qr_1_b: 4 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ +2684.11/s (n=8616 # Rate qr_1_b qr_1_a # qr_1_b 2684/s -- -0% # qr_1_a 2692/s 0% -- # Again, no difference that is not attributable to system noise. sub qr_1_c { $qr = qr[[spectal]{9}]; $data =~ m/$qr/o; 1 } sub qr_1_d { $qr = qr[[spectal]{9}]o; $data =~ m/$qr/o; 1 } cmpthese( 0, { qr_1_a => \&qr_1_a, qr_1_b => \&qr_1_b, qr_1_c => \&qr_1_c, qr_1_d => \&qr_1_d } ); # Benchmark: running qr_1_a, qr_1_b, qr_1_c, qr_1_d, each for at least + 3 CPU seconds... # qr_1_a: 3 wallclock secs ( 3.12 usr + 0.00 sys = 3.12 CPU) @ +2702.56/s (n=8432) # qr_1_b: 2 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ +2693.93/s (n=8432) # qr_1_c: 5 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ +2747.35/s (n=8819) # qr_1_d: 2 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ +2744.55/s (n=8810) # Rate qr_1_b qr_1_a qr_1_d qr_1_c # qr_1_b 2694/s -- -0% -2% -2% # qr_1_a 2703/s 0% -- -2% -2% # qr_1_d 2745/s 2% 2% -- -0% # qr_1_c 2747/s 2% 2% 0% -- # This shows the very slight difference between c/d and a/b. In a/b # the regcomp on the match() is a very fast operation. In c/d every ma +tch() # except the first has no regcomp() operation. # Appending /o on a qr// expression has the same effect as on a m// # expression. In the following code the qr// expression only compiles +once # and all repetitions after the first go-round merely copy the Regex o +bject. # So the effect is only relevant if you are interpolating something in +to your # regular expression. As always, compilation only occurs the first tim +e and then # /o prevents the compilation from occuring again. Without /o the expr +ession would # be properly recompiled each time 'round. The difference is between h +aving $qro # reflect the currently interpolated $_ variable and having it permena +ntly fixed as # '9' which is the first value to be used. I don't know of any circums +tance when this # is the required behaviour so I'd categorize any use of /o with qr// +as a bug. # Count backwards from 9 for(qw( 9 8 7 6 5 4 3 2 1 0 )) { $qro = qr[[spectal]{$_}]o; } if ($qro ne "(?-xism:[spectal]{9})") { die; } # $qro is equal to (?-xism:[spectal]{9,9}) which demonstrates that the # compilation step was removed after its one-time-only execution. # All three of these "variations" produce the same data and follow the + same # execution path. Any speed differences should be attributed to system + noise # as in a very real sense - they are 100% identical. Since $qr and $qr +o contain # precompiled regular expressions is is *nearly* as efficient as a pla +in match like # $data =~ /[spectal]{9}/. The overhead is the regcreset() and regcomp +() operations. # Since $qr and $qro already contain compiled regular expressions regc +omp() skips # all the compilation and returns very quickly. This is a low-overhead + operation. $qr = qr[[spectal]{9}]; $qro = qr[[spectal]{9}]o; sub qr_2_a { $data =~ $qr } sub qr_2_b { $data =~ $qro } sub qr_2_c { $data =~ /$qr/ } sub qr_2_d { $data =~ /$qro/ } cmpthese( 0, { qr_2_a => \&qr_2_a, qr_2_b => \&qr_2_b, qr_2_c => \&qr_2_c, qr_2_d => \&qr_2_d } ); # Benchmark: running qr_2_a, qr_2_b, qr_2_c, qr_2_d, each for at least + 3 CPU seconds... # qr_2_a: 4 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ +3850.47/s (n=12206) # qr_2_b: 2 wallclock secs ( 3.07 usr + 0.00 sys = 3.07 CPU) @ +3812.70/s (n=11705) # qr_2_c: 3 wallclock secs ( 3.15 usr + 0.00 sys = 3.15 CPU) @ +3836.51/s (n=12085) # qr_2_d: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ +3812.30/s (n=12085) # Rate qr_2_d qr_2_b qr_2_c qr_2_a # qr_2_d 3812/s -- -0% -1% -1% # qr_2_b 3813/s 0% -- -1% -1% # qr_2_c 3837/s 1% 1% -- -0% # qr_2_a 3850/s 1% 1% 0% -- # A small difference but its just system noise. There is no actual dif +ference here. # All of these "variations" produce the same data and follow the same # execution path. Any speed differences should be attributed to system + noise # as in a very real sense - they are 100% identical. Since $qr and $qr +o contain # precompiled regular expressions is is *nearly* as efficient as a pla +in match like # $data =~ /[spectal]{9}/. The overhead is the regcreset() and regcomp +() operations. # Since $qr and $qro already contain compiled regular expressions regc +omp() skips # all the compilation and returns very quickly. This is a low-overhead + operation. sub qr_2_e { $data =~ /$qr/o } sub qr_2_f { $data =~ /$qro/o } cmpthese( 0, { qr_2_a => \&qr_2_a, qr_2_b => \&qr_2_b, qr_2_c => \&qr_2_c, qr_2_d => \&qr_2_d, qr_2_e => \&qr_2_e, qr_2_f => \&qr_2_f } ); # Benchmark: running qr_2_a, qr_2_b, qr_2_c, qr_2_d, qr_2_e, qr_2_f, e +ach for at least 3 CPU seconds... # qr_2_a: 4 wallclock secs ( 3.26 usr + 0.00 sys = 3.26 CPU) @ +3864.72/s (n=12599) # qr_2_b: 5 wallclock secs ( 3.05 usr + 0.00 sys = 3.05 CPU) @ +3880.33/s (n=11835) # qr_2_c: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ +3858.52/s (n=12000) # qr_2_d: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ +3858.52/s (n=12000) # qr_2_e: 4 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ +3950.16/s (n=12206) # qr_2_f: 5 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ +3911.00/s (n=12085) # Rate qr_2_d qr_2_c qr_2_a qr_2_b qr_2_f qr_2_e # qr_2_d 3859/s -- -0% -0% -1% -1% -2% # qr_2_c 3859/s 0% -- -0% -1% -1% -2% # qr_2_a 3865/s 0% 0% -- -0% -1% -2% # qr_2_b 3880/s 1% 1% 0% -- -1% -2% # qr_2_f 3911/s 1% 1% 1% 1% -- -1% # qr_2_e 3950/s 2% 2% 2% 2% 1% -- # e/f are in reality ever so slightly faster than a/b/c/d which are al +l equivalent. # This just shows that the speed gain is small enough to disappear rig +ht into # system noise. # If the /o flag is added onto /$qr/ then that expression becomes fore +ver bound to # whatever value $qr contained at the time it was first executed. It t +urns out that /o # works by having the regcomp() operation remove itself from the execu +ting program.

Replies are listed 'Best First'.
Re: /o is dead, long live qr//!
by japhy (Canon) on Jun 25, 2003 at 23:43 UTC
    There is not a noticeable increase in speed in using $x = qr/.../; /$x/ instead of $x = '...'; /$x/o; in the testing I have done. And the /o modifier is ONLY useful if the regex has any variables in it.

    The place where qr// is important is when looping over strings and patterns. If you do a benchmark of code like this:

    while (<>) { for my $p (qr/a+b/, qr/c+d/, qr/e+f/) { foo() if /$p/ } } # vs. while (<>) { for my $s ('a+b', 'c+d', 'e+f') { foo() if /$s/ } }
    The qr// version will be considerably faster because Perl knows the regexes are already compiled, whereas in the string version, we need to compile three regexes for every line of input.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Actually, I tend to shy away from using qr, because they are dead slow when interpolating into larger constructs. Building large regexes from smaller parts is something I do often. Here's a benchmark showing a dramatic difference:
      #!/usr/bin/perl use strict; use warnings; use Benchmark qw /cmpthese/; cmpthese -2 => { qq => 'my $x = qq "[a]"; $x = qq "[a]$x" for 1 .. 100; "" =~ /$x +/', qr => 'my $x = qr "[a]"; $x = qr "[a]$x" for 1 .. 100; "" =~ /$x +/', } __END__ Benchmark: running qq, qr for at least 2 CPU seconds... qq: 6 wallclock secs ( 2.11 usr + 0.03 sys = 2.14 CPU) @ 5902.80/ +s (n=12632) qr: 4 wallclock secs ( 2.03 usr + 0.00 sys = 2.03 CPU) @ 39.90/s +(n=81) Rate qr qq qr 39.9/s -- -99% qq 5903/s 14693% --

      Abigail

        Too be fair, this is a problem with the *current* implementation of qr, and not the concept. Your two regexps are actually:
        #UPDATE: While the previous verbose format was more # impressive, it was hideously long. See the # readmore for the literal strings. $qq = "[a]"x100; $qr = "(?-xism:[a]"x100 . ")"x100;
        I think even a newbie could see that the latter is outrageously complicated...

      That's exactly what I said. The difference between $x as an object and $x as a string is a quick shortcut and an eq test.

Re: /o is dead, long live qr//!
by belg4mit (Prior) on Jun 26, 2003 at 01:08 UTC
Re: /o is dead, long live qr//!
by Anonymous Monk on Jun 26, 2003 at 12:01 UTC
    What about this:
    use constant foo => 'bar'; my $x = qr/ ${\&foo} /xo; "something" =~ $x;
    I found this to be the only way I could achieve this effect. Is there a better way?

      Yes and? Your qr// expression interpolated and then fixed the constant into place. Normally I'd just eschew that as a particularly ugly form of a regex though. In fact, I'd likely have written that as this instead. I'd be using the constant like its intended (as in, not like a cleverly named function) and I still get something reasonable. Now other people like Perrin have been convincing me that constant isn't all that great anyway especially given the bareword quoting rules with the => fat comma and interpolation (like you noticed).

      use constant FOO => 'bar'; my $x = FOO; $x = qr/$x/; "somehting" =~ $x

      So actually, I wouldn't have written it at all like that. This is more likely. Though I wouldn't have gone out of my way to create a qr// object if I was only going to use it in one place anyway. That looks like something that'd be better written as merely "something" =~ $FOO.

      our $FOO = 'bar'; my $x = qr/$FOO/; "something" =~ $x;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://269035]
Approved by Coruscate
Front-paged by Coruscate
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-03-28 21:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found