http://qs321.pair.com?node_id=637490

TedYoung has asked for the wisdom of the Perl Monks concerning the following question:

Anyone who does anything significant with regular expressions learns that the variables $&, $' and $` imposes significant performance penalty on all regular expression matches. I have personally witnessed just how significant the use of these variables can be on performance (KinoSearch & Large Documents).

perlvar says under @- the following:

$` is the same as substr($var, 0, $-[0]) $& is the same as substr($var, $-[0], $+[0] - $-[0]) $' is the same as substr($var, $+[0])

So my question is why can't these variables be implemented this way behind the scenes. If I could get access to the last variable successfully matched against, for instance, I could write a Tie interface for these variables very easy.

I am guessing it has to do with keeping track of the last variable matched and handling cases where that variable is changed before reading $& (such as assignment or a s///).

This question is really out of curiosity about the internals of the perl regexp engine.

Thanks,

Ted Young

($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

Replies are listed 'Best First'.
Re: Can we make $& better?
by ikegami (Patriarch) on Sep 06, 2007 at 18:20 UTC

    There is a difference:

    $\ = "\n"; my $s = 'abcdefghij'; $s =~ /def/; print $&; # def print substr($s, $-[0], $+[0] - $-[0]); # def $s = '1234567890'; print $&; # def print substr($s, $-[0], $+[0] - $-[0]); # 456

    Does it matter? Maybe. Switching could break existing code.

      The difference you point out is even more blatant in this example:
      $\ = "\n"; my $s = 'abcdefghij'; $s =~ s/def/123/; print $&; # def print substr($s, $-[0], $+[0] - $-[0]); # 123
      Some code is surely using that.

      The problem with $& and friends isn't the location of the partial strings but the semantics that require copying the parts. Thus for every match, the entire string must be copied.

      Anecdote:

      A poster on clpmisc complained that his program "stopped working" after he introduced use English. He was matching a short pattern against some monster genome string (one or two GB) with a reasonable number of matches (some 10_000). Copying killed the cat, to coin a phrase.

      Anno
Re: Can we make $& better? (no need with $^MATCH)
by grinder (Bishop) on Sep 06, 2007 at 19:45 UTC

    In perl 5.10, $^MATCH will be equivalent to $& but at the same time the global speed penalties will not come into play. This means $& can be left alone.

    The price to pay is that you will need to add the /p modifier flag to the pattern in order to tell perl that you want them. A reasonable trade-off.

    • another intruder with the mooring in the heart of the Perl

      Um, that gets rid of the global performance penalty but (as near as I can tell) keeps the local performance penalty, so I don't agree with "no need".

      If you are writing a parser that is spending most of its time applying regexes against potentially large strings, then the local penalty is plenty big of a penalty. It would be very nice to be able to use $& and $1 etc. w/o the local penalty with the caveat that those variables no longer work correctly if you modify the string most recently matched against (no, I'm not talking about being able to use $1 in s/// w/o the local 'copy' penalty).

      I vaguely proposed a /k option to prevent copying (and an option to prevent capturing, a somewhat related problem). I don't recall seeing such in the announcement and didn't find the announcement to check again.

      I don't even mind if $1 etc. can't be used directly. I'd just like to be able to disable the copying of the entire string while still being able to pull out the parts based on @- and @+ (a simple module could make these nearly as easy to use as $1 etc.).

      - tye        

        You still could allow changing the string to match against, and be able to read the contents of $1 and friends later, if a copy-on-write system was implemented. That way these match strings would only get copied in case your original string changes underneath.

        I'd prefer it if Perl was smart enough to skip the copy in case you don't need it, but I doubt it could be made so smart without a little manual help from a flag.

        The solution to the copy problem is to use m//g in scalar context which will NOT copy the string, and WILL result in the special match vars returning incorrect results if the string is changed after the fact.

        And yes, we did discuss some other options to control this behaviour, its just i never got around to dealing with them and now its too late for 5.10. Sorry about that.

        BTW, due to a misconception on my part I was extremely reluctant to add new modifiers. Im now much less reluctant as I resolved the misconception. The misconception was that adding new modifiers would break loads and loads of stuff, but further analysis proved that this was an unfounded concern.

        ---
        $world=~s/war/peace/g

Re: Can we make $& better?
by ferreira (Chaplain) on Sep 06, 2007 at 19:59 UTC

      Actually this post is the first time ive even heard of Regexp::MatchContext so no it wasnt in any way the inspiration for ${^MATCH} and friends.

      And actually its totally fluke that both use 'p' as the new flag, originally it was 'k' in the core implementation, which was changed to 'p' (for preserve) later on because another TheDamian module uses 'k' for special purposes.

      ---
      $world=~s/war/peace/g