http://qs321.pair.com?node_id=218046

mce has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Can anyone enlighten me about the usage of the /o option in regular expressions. I have read the documentation (camel book), but this even puzzels me more. f.e.

my @y=(1..9); my $x=1; foreach my $value ( @y ) { if ( $value =~ /$x/ ) { print "without o\n"; } if ( $value =~ /$x/o ) { print "with o\n"; } ++$x; }
will print "with o" only once while "without o" 9 times.

So, this means that using /o is the same as fast way precompiling the regex, like you would with qr//?

But what is the difference then?
Perl keeps the pattern compiled in memory as well with a /o as with qr? But with qr, you can access it via a variable, and make it lexical. Is that the only difference?

Is this all correct?


---------------------------
Dr. Mark Ceulemans
Senior Consultant
IT Masters, Belgium

Replies are listed 'Best First'.
Re: meaning of /o in regexes
by BrowserUk (Patriarch) on Dec 06, 2002 at 12:48 UTC

    My understanding, which is confirmed by your tests, is that when you use the /o modifier, any variables within the regex will only be interpolated the first time the regex is seen.

    This appears to be similar in effect to using the qr// op to create your regexes in advance. However, using qr// has the advantage that you can pre-compile your regexes in sections and then combine them in the m// and s/// operators in different combinations.

    A few things I haven't seen an explanation for (they may exist, I just haven't seen them):

    1. Why does the qr// operator accept the /o midifier?
    2. If you combine one or more parts defined with qr// in a regex with some non-precompiled stuff, do you still get the advantage of precompilation?

      Eg.

      my $re_int = qr/[+-]?\d+/; my $re_exp = qr/[Ee][+-]?$re_int/; if ($str =~ m/^(?:$re_int\.)?$re_int$re_exp?$/ ) { print "I think I got a valid int or float"\n"; }
    3. If there was a non-compiled var reference in the above m//, do I still get any benefit from pre-compiling the other parts?
    4. What happens if I add a /o modifier to the m// above?
    5. If one or more of the per-compiled parts (and/or the non-precompiled parts) contains capture brackets, was there any benefit (in performance terms) from pre-compiling some parts?

    I did once attempt to systematically benchmark these to try and determine what coptions and combinations of options had greatest benefit from the performance standpoint, but the process is fraught with gotchas.


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

      Just a note: none of this applies if you are using qr the way it's meant - as the entire regex for m// or s/// as in $qr = qr/./; $_ = 'abc'; m/$qr/; s/$qr//; $_ =~ $qr. All of these more normal uses of expressions benefit from the precompilation. This note is about interpolating qr objects into other regular expressions which is different.


      Starting from the top: I created the short sample program and then dumped it's opcode tree to see what it actually does. From this I can say that interpolating qr objects into another regular expression saves nothing. The objects are all concatenated (meaning stringification) and then compiled for the regex. If you add the /o modifier to any m// or s/// operation then it binds the compiled form to that location in hte opcode tree. There is no reason for that to change just because you used a qr in the regex or not. If you read Dominus' remarks on that at Dirty Secrets of the Perl Regex Engine then that will be clear.

      The answers to your questions (in order):

      1. I don't know
      2. no (you are penalized)
      3. no (you are penalized)
      4. the same thing it always does
      5. no (you are penalized)
      The penalizing is from having to do a magic_get on the qr ops instead of just reading it as a string and then the overall penalty of doing work more than once (compile the regex for qr, mg_get the stringified form, then compile the larger regex). Or at least that's how I read it. Please correct me if I'm wrong - I am still quite a novice at this.

      $qr = qr/./; 'a' =~ /$qr$qr/; __DATA__ C:\>perl -MO=Concise qr.pl e <@> leave[t1] vKP/REFC ->(end) 1 <0> enter ->2 2 <;> nextstate(main 5 qr.pl:1) v ->3 5 <2> sassign vKS/2 ->6 3 </> qr(/./) s ->4 - <1> ex-rv2sv sKRM*/1 ->5 4 <> gvsv s ->5 6 <;> nextstate(main 5 qr.pl:3) v ->7 d </> match() vKS ->e 7 <$> const(SPECIAL Null)[t5] s ->8 c <|> regcomp(other->d) sK/1 ->d 8 <1> regcreset sK/1 ->9 >> This is where you see the two [qr] expressions >> being fetched as global scalar values, >> concatenated and *then* just above this the >> regex is compiled. b <2> concat[t4] sK/2 ->c - <1> ex-rv2sv sK/1 ->a 9 <> gvsv s ->a - <1> ex-rv2sv sK/1 ->b a <> gvsv s ->b

      I'm working off of the three references http://perl.plover .com/Rx/, and perlop (the gory quoting part. See also pp_hot.c for pp_concat which doesn't do anything special for qr magic. It's just strings at that point.

      __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;

        Thankyou diotalevi++. That is exactly the sort of answer I was looking for and it confirms my suspicions based on some fairly dodgey benchmarking.

        No matter how hard I tried to isolate the benefits of qr//'ing or /o'ing, those benefits always seemed to disappear whenever I attempted to combine one or more pre-compiled regexes with each other or with some non-compiled stuff. In fact, I sometimes detected a penalty from using pre-compiled regexes other than stand-alone, though the differences were too small to quantify with any accuracy.


        Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
        Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
        Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
        Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

      1 - Why does the qr// operator accept the /o midifier?
      Just for compatibility reasons, it doesn't actually have any effect on the resulting regex object.
      2- If you combine one or more parts defined with qr// in a regex with some non-precompiled stuff, do you still get the advantage of precompilation?
      In your example, you get the advantage of compilation with the regex objects but the match still has to be compiled dynamically since it contains variables (although with regex objects should be slightly faster than plain strings since they're already compiled).
      3 - If there was a non-compiled var reference in the above m//, do I still get any benefit from pre-compiling the other parts?
      Yup, as the regex is already compiled, where as a plain string has to be compiled first before matching.
      4 - What happens if I add a /o modifier to the m// above?
      I believe the effect would be the same if you were using plain strings as it would compile to the same thing, so the result would be that match regex would only be compiled once and once compiled it's always the same. So you'll get a tiny benefit with the pre-compiled regexes on first compilation, but in the long run it's pretty negligble.
      5 - If one or more of the per-compiled parts (and/or the non-precompiled parts) contains capture brackets, was there any benefit (in performance terms) from pre-compiling some parts?
      There's no reason for capturing to effect the performance of a compiled regex vs a compile'n'do regex, as once compiled a regex will perform the same whether it was compiled or otherwise.
      HTH

      _________
      broquaint

      1. Why does the qr// operator accept the /o midifier?

      So that you can create a 'static' compiled regex object that can be interpolated in to more complex patterns subsquently in the program.

      2. If you combine one or more parts defined with qr// in a regex with some non-precompiled stuff, do you still get the advantage of precompilation?

      The discussion on p194 of Camel 3rd Ed. states that you can 'chain' qr// operaters into one pattern to prevent re-compilation, so the answer would appear to be 'no'.

      3. If there was a non-compiled var reference in the above m//, do I still get any benefit from pre-compiling the other parts?

      No; again, according to the reference above, the pattern would be re-compiled.

      4. What happens if I add a /o modifier to the m// above?

      You'd get a once-only compilation of the pattern.

      5. If one or more of the per-compiled parts (and/or the non-precompiled parts) contains capture brackets, was there any benefit (in performance terms) from pre-compiling some parts?

      I doubt that the presence or absence of capture brackets makes much difference to whether or not precompilation provides any benefit.

        Que?

        So patterns compiled with qr// are 'dynamic' unless I use the /o modifier? Could you explain your definition of 'static' in this context? Can you give me a reference to this information?

        2) & 3) - I think I would want considerably more factual information regarding what runtime steps are prevented from repetition by the use of qr// than I can derive from your breif quote, before I could draw any conclusions, never mind your definitive statement.

        4) So, did I benefit, in terms of runtime performance from pre-compiling some parts of the final pattern? Or am I in effect forcing the pre-compiled parts of the regex to be re-inspected? Would it actually be better to simply put all the parts together in a single regex with the /o modifier so that the compiler only needs to process everything one time?

        5) From what source do you derive that conclusion?

        It would make sense to me that if I use qr// or possibly the /o (which I think amount to pretty much the same thing, but am open to correction), that if the regex contains one or more sets of capture brackets, grouping brackets, repetition modifiers etc. It could be possible to pre-build a parsing tree (or somesuch) so that (for example) the size of the @+ and @- arrays could be pre-allocated and pointed to rather than needing to do this at runtime. However, if this was done for 2 seperate patterns each containing a set of capture brackets, when they become combined together, that pre-allocation needs to change.

        Whilst there may be some benefit in combining two pre-parsed regexes together by using whatever data-structures are built internally to represent them, when these are further combined with non-precompiled parts, it might simply be quicker to have the regex engine build the internal data-structure to represent the entire pattern in a single pass rather than parsing the non-compiled parts, having to take into account the effects that the embedded pre-compiled parts have on (for example) capture bracket numbering.

        I would like to know, without needing to resort to source-diving, which of the two approaches is used, and which has the least impact at runtime?


        Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
        Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
        Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
        Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: meaning of /o in regexes
by dakkar (Hermit) on Dec 06, 2002 at 12:48 UTC

    When you have a regexp with interpolation (as in your example), it normally gets recompiled each time it has to be executed, because the variables to be interpolated might have changed. This is slow, but is usually the right thing.

    If you are sure that the variables will not change (say, because they are set outside the loop in which the regexp is used), you can say /o to tell the compiler "don't worry, just compile this regexp the first time, the variables won't change".

    Now, if you do change the variables, you're cheating the compiler, breaking your promises, and so on. The compiler still trusts you, and will not recompile the regexp (hence the results of your test).

    This was all before qr//. Now I think it is the best way to avoid recompiling regexps.

    So, in short, yes, it is the only difference.

    -- 
            dakkar - Mobilis in mobile
    
Re: Meaning of /o in Regexes
by cjf-II (Monk) on Dec 06, 2002 at 12:30 UTC

    The /o modifier is used when you only want the pattern compiled once. So in the second pattern, $x will not be incremented and as such, only matches once.

    Update: Okay, I promise to read the questions in full before replying from now on (and this time I mean it! ;-). So to answer your actual question...

    The difference between qr// and the /o modifier as I understand it is that with /o you can never change the pattern, whereas with qr// you can compile the pattern to be used as part of a larger regex. As the good book shows:

    @regexes = (); for $pattern (@patterns) { push @regexes, qr/$pattern/; } for $item (@data) { for $re (@regexes) { if ($item =~ /$re/) { print "Matches!\n"; } } }

    I Hope that's slightly more helpful :).

      Thanks, But this part I understood, it is the comparison with qr that I am puzzeled about.
      ---------------------------
      Dr. Mark Ceulemans
      Senior Consultant
      IT Masters, Belgium

        Thanks, But this part I understood, it is the comparison with qr that I am puzzeled about.

        /o and qr// provide different services to the programmer:

        /o says 'Take this pattern and create a compiled regex that will not change for the life of the program'

        qr// says 'Take this pattern and return a special object that I can assign to a variable, interpolate into another pattern, pass to a subroutine, etc, etc, and as a side-effect creates a compiled form that I can use in pattern matches directly.'

(tye)Re: meaning of /o in regexes
by tye (Sage) on Dec 07, 2002 at 06:01 UTC

    First, I'll apologize. I haven't read every node in this thread in detail. But I think there are a few key points about /o that should be brought up that I didn't notice being mentioned.

    1) Don't use /o anymore. If you aren't into the subtle points, then just remember that. /o used to provide a rather significant speed improvement in certain rare situtations. Several things have changed related to this. One is that now it provides only a minor speed improvement.

    What hasn't changed is that /o can easily lead to bugs in your code. It means that $y =~ /$x/ no longer matches $y against $x but instead matches $y against whatever was in $x the first time the code got executed. This can be quite confusing and it is easy to get this wrong.

    2) If you really want the minor speed improvement that /o could still give you, then you should use qr// instead. That is, replace:

    sub foo { $y =~ /$x/o; }
    with
    BEGIN { my $re; sub foo { $re= qr/$x/ unless $re; $y =~ $re; } }
    or, more likely, where you now set $x to be the one value you want to use, instead set $re= qr/$x/ in that place (sorry, the second example looks a bit complicated, but that is mostly do to with the contrived nature of the example -- in reality, using qr// in place of //o likely makes your code easier to understand).

    Using qr// is even better because you get the speed improvement of being able to use $re 10,000 times without recompiling and yet you can update your program such that you do $re= qr/$x/ a second time and then use $re 10,000 more times with only the one extra compilation. qr// is every bit as fast as //o, but it is much more flexible and much less error prone.

    3) qr//o is identical to qr//. In fact, japhy (I believe) ran off one day to patch Perl such that qr//o would no longer be supported (since supporting it just leads to confusion as you have now seen). I'm curious to see what happened to that patch, but unfortunately now is not the time to me to go digging for that.

    More rambling details...

    It used to be that saying /$x/ would cause the regex to be recompiled every single time it was executed. Now, /$x/ is only recompiled if $x has changed since the last time it was recompiled.

    qr// gets compiled every time the qr// part is executed, even if you say qr//o. If you really only want it to compile once, then you should only execute the qr// once. (qr//o really should be an error.) BTW, I tested this assertion in Perl 5.006.

            - tye

      And of course if you happen to run your regex in a Perl script in an Apache Webserver with mod-perl your regex gets instantiated to whatever value the first user put into it as your script gets compiled only once for an untold number of requests.

      It can make for a lot of interesting bugs!

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: meaning of /o in regexes
by sauoq (Abbot) on Dec 07, 2002 at 02:26 UTC

    A touch of history might shed some light on the matter too. The /o modifier existed quite a while before we were blessed with the ability to assign those compiled regular expressions to variables via qr// which, if I recall correctly, wasn't available in a stable release until 5.005.

    -sauoq
    "My two cents aren't worth a dime.";