http://qs321.pair.com?node_id=245725

Match variables are one of those things that sometimes cause much head-scratching amongst people writing perl code. Are they globals? Are they lexicals? What resets them, and when do they go away? What the heck is really happening with these darned things, and why do odd values show up sometimes?

Match variables are, safe to say, somewhat perplexing.

First, there's some history to understand. Regex match variables are an old feature of perl, predating lexicals by quite a while. Lexical variables only came into perl with perl 5, while match variables have been in perl since at least version 3, and may date back to version 2 or 1. (Which predates me by a lot, so I'm not sure)

Match variables behave sort of like lexical variables, but not entirely like lexicals. That means code like:

"foo" =~ m/(oo)/; { "bar" =~ /(ar)/; } print $1, "\n";
prints oo. That makes it look like match variables are lexically scoped--the inner match didn't affect the outer scope's match variables. But what happens when you try:
$search_string = "123abc"; sub foo { $again = shift; $search_string =~ /(\d+)(\w+)/; print "$1 $2\n"; return unless $again; $search_string = "456def"; foo(0); print "$1 $2\n"; } foo(1);
You'd think it prints
123 abc 456 def 123 abc
but it doesn't. Instead, it prints
123 abc 456 def 456 def
Why? Because the match variables aren't really lexical. What they are is tied to perl's optree at compile time by perl's compiler. The compiler tries to make them lexical, but there's a limit to what it can do because it's a compile-time thing (naturally) while lexicals have a runtime component. To understand what's going on, you have to understand some of how perl compiles your program before running it, and how match variables mix with that.

When perl compiles your program, it builds up a big tree structure, called an optree, filled with nodes, called op nodes or opcodes. Each node in the tree represents an action that perl must take. Nodes in the tree can have a variety of things hanging off of them, including the next node to take, the different nodes to take for conditional tests (the true and false nodes), source code information, and regex match variables.

That's right, the regex match variables are attached to the optree.

Lexical variables, on the other hand, live in a scratchpad, one pad per sub, and every time a sub is called a new pad is (potentially) allocated. That way recursive subs work out--each time you enter the sub recursively a new pad is allocated. If that didn't happen, each recursive invocation would reuse the same pad and stomp on variables, which would be bad. Because lexicals live in a scratchpad, which is separate from the code, multiple overlapping invocations of a sub don't have their lexical variables collide.

Since the regex match variables live in the optree, rather than in a scratchpad, there's only one copy per node. Many nodes may share the same match variables, of course--the match node and the print node in a statement like:

"foo" =~ /(oo)/; print $1;
reference the same match variables.

Perl's compiler is pretty clever, and simulates lexicalness in most cases. If your code is like our original example,

"foo" =~ m/(oo)/; { "bar" =~ /(ar)/; } print $1, "\n";
it works right. That's because the nodes that represent the code inside the block reference different match variables than code outside the block. That's lexical scoping, but it's compile-time lexical scoping. Where that bites us in in our second example:
$search_string = "123abc"; sub foo { $again = shift; $search_string =~ /(\d+)(\w+)/; print "$1 $2\n"; return unless $again; $search_string = "456def"; foo(0); print "$1 $2\n"; } foo(1);
The compiler looks at this code and sees one match, inside one block, the block for foo. It then generates one set of match variables and attaches it to all the nodes in the tree for the sub.

When we execute this code, foo is called. The first match takes place, and the match variables attached to the match node are filled in. Then we print them. That part's fine.

Next, the search string's redefined, and we call the sub recursively. (With a parameter to keep the recursion from going on forever) The match happens again, and the match variables attached to the match node are set to the new match results. The variables are printed, then the recursive invocation exits.

Then in the top level invocation we print the match variables again. And, interestingly, we get the values from the recursive call. Why? Well, remember we said the match variables were attached to nodes of the optree, the compiled version of your code. There's only one optree for the foo subroutine, no matter how many times we invoke it recursively. That means that no matter how many times we invoke it, we always are referencing the same variables, potentially stomping on previous values unknowingly.

This is also why returning references to match variables for later storage is an exercise in pain. The reference you return is, of course, a reference to this shared match variable, so each and every time you execute the code the variable came from you'll be overwriting it with a new value. (This doesn't affect returning the actual value, rather than a reference, since perl will make a copy just like it does for any other plain scalar)

This is also why match variables in closures can behave somewhat unusually. Multiple instances of the closure all share the same optree, since there's really only one optree for all the closure instances, which differ only in the scratchpad gets passed into the anonymous subroutine. Perl does some code to initialize the variables, sometimes, but it's possible to see old data left from previous invocations of different versions of the same closure.

Replies are listed 'Best First'.
Re: Zen and the Art of Match Variables (copy?)
by tye (Sage) on Mar 25, 2003 at 22:00 UTC

    Thanks. Very informative.

    Perhaps I missed it or perhaps I misunderstand how things work, but isn't there a copy operation that happens at run time that you didn't cover? I'm thinking a copying from the outer scope's match variables to the inner scope's match variables (when the inner scope is entered?), but I'm basing this on what I've heard others say not stuff I've investigated myself.

    And it seems \$1 gives you a reference to a magic variable that looks up the value using the match variables of the opcode where you dereference it. That would be one way to explain why this code:

    $_= "foobar"; /(oo)/; my $out= \$1; my $in; { /(ar)/; $in= \$1; } print "in=out\n" if $in == $out;
    prints "in=out". So I'd think the problem with returning a reference to a match variable would be that you'd get the outer scope's matches rather than that the value would be overwritten by subsequent calls to the subroutine.

                    - tye
      The inner scope doesn't get a copy of the match variables, rather the optree for the inner scope just has reference to the outer scope match variables--the opnode pointers point to the same memory. (Or they did the last time I looked, though some of this code was touched to fix crashes in Windows forking perl and iThreads for 5.8.0)

      You're right about the references--\$1 makes reference to the magic match variable which always refers to the active first match, as it works its magic by peering at the current optree pointers. I'll go patch that bit up.

        Let me try again.

        In your root node you show how leaving a scope appears to restore the old values of the match variables and explain that this actually happens by virtue of the outer scope's opnodes pointing to a different set of match variables from the inner scope's.

        But this doesn't explain how an inner scope manages to have the same value for match values as the outer scope when they haven't been overwritten. For example:

        for( qw( foobar foobaz ) ) { /(oo)/; { /(ar)/; print "($1) "; } print "[$1]\n"; }
        outputs:
        (ar) [oo] (oo) [oo]
        and your root node explains why "[oo]" is output both times but doesn't explain how "(oo)" could be output.

        That is, how did the inner scope manage to get $1 a value of "oo" when that was set in the outer scope's match variable? I recall hearing that this is done by copying from one set of match variables to the other.

                        - tye
Re: Zen and the Art of Match Variables
by kelan (Deacon) on Mar 25, 2003 at 18:46 UTC

    Just curious, will Parrot have to implement the match variables this way, too? Or will it be able to do a better job at maintaining the Principle of Least Surprise? I don't know how much of Grammars have been implemented so far, so maybe you can't even answer this yet. Just wondering :)

    Update: Changed "Non-" to "Least" in the principle name. Makes better sense.

    kelan


    Perl6 Grammar Student

      Perl 6 and parrot regexes are much more self-contained, so this sort of thing won't happen. (I'm torn as to whether to make it happen for the perl 5 compatibility code, but I'm thinking not right now) The only real reason it works like this in perl now is because regexes predate lexicals--if we were doing it over again for perl 5 it wouldn't happen the way it does now.
Re: Zen and the Art of Match Variables
by herveus (Prior) on Mar 25, 2003 at 18:14 UTC
    Howdy!

    That was most informative! Thanks.

    If I am using the match variables, I make a particular point of capturing the values very promptly. Certainly before there is any opportunity for another regex to stomp the values. I work under the assumption that the values are very perishable; if I want to keep them, I need to put them in a safe place promptly.

    yours,
    Michael

Re: Zen and the Art of Match Variables
by Zaxo (Archbishop) on Mar 25, 2003 at 19:53 UTC

    ++Elian, well done!. I'd like to nominate this for Tutorials.

    After Compline,
    Zaxo

      If you like, sure, but I'm not sure it's all that tutorialish--it's not like it's telling you how to do things, more explaining how things work. I'm easy, though, so whatever... :)