comment on

Match variables are one of those things that sometimes cause much head-scratching amongst people writing perl code. Are they globals? Are they lexicals? What resets them, and when do they go away? What the heck is really happening with these darned things, and why do odd values show up sometimes?

Match variables are, safe to say, somewhat perplexing.

First, there's some history to understand. Regex match variables are an old feature of perl, predating lexicals by quite a while. Lexical variables only came into perl with perl 5, while match variables have been in perl since at least version 3, and may date back to version 2 or 1. (Which predates me by a lot, so I'm not sure)

Match variables behave sort of like lexical variables, but not entirely like lexicals. That means code like:

      "foo" =~ m/(oo)/;
      {
        "bar" =~ /(ar)/;
      }
      print $1, "\n";
[download]

prints oo. That makes it look like match variables are lexically scoped--the inner match didn't affect the outer scope's match variables. But what happens when you try:

$search_string = "123abc";

sub foo {
    $again = shift;
    $search_string =~ /(\d+)(\w+)/;
    print "$1 $2\n";
    return unless $again;
    $search_string = "456def";
    foo(0);
    print "$1 $2\n";
}

foo(1);
[download]

You'd think it prints

123 abc
456 def
123 abc
[download]

but it doesn't. Instead, it prints

123 abc
456 def
456 def
[download]

Why? Because the match variables aren't really lexical. What they are is tied to perl's optree at compile time by perl's compiler. The compiler tries to make them lexical, but there's a limit to what it can do because it's a compile-time thing (naturally) while lexicals have a runtime component. To understand what's going on, you have to understand some of how perl compiles your program before running it, and how match variables mix with that.

When perl compiles your program, it builds up a big tree structure, called an optree, filled with nodes, called op nodes or opcodes. Each node in the tree represents an action that perl must take. Nodes in the tree can have a variety of things hanging off of them, including the next node to take, the different nodes to take for conditional tests (the true and false nodes), source code information, and regex match variables.

That's right, the regex match variables are attached to the optree.

Lexical variables, on the other hand, live in a scratchpad, one pad per sub, and every time a sub is called a new pad is (potentially) allocated. That way recursive subs work out--each time you enter the sub recursively a new pad is allocated. If that didn't happen, each recursive invocation would reuse the same pad and stomp on variables, which would be bad. Because lexicals live in a scratchpad, which is separate from the code, multiple overlapping invocations of a sub don't have their lexical variables collide.

Since the regex match variables live in the optree, rather than in a scratchpad, there's only one copy per node. Many nodes may share the same match variables, of course--the match node and the print node in a statement like:

   "foo" =~ /(oo)/;
   print $1;
[download]

reference the same match variables.

Perl's compiler is pretty clever, and simulates lexicalness in most cases. If your code is like our original example,

      "foo" =~ m/(oo)/;
      {
        "bar" =~ /(ar)/;
      }
      print $1, "\n";
[download]

it works right. That's because the nodes that represent the code inside the block reference different match variables than code outside the block. That's lexical scoping, but it's compile-time lexical scoping. Where that bites us in in our second example:

$search_string = "123abc";

sub foo {
    $again = shift;
    $search_string =~ /(\d+)(\w+)/;
    print "$1 $2\n";
    return unless $again;
    $search_string = "456def";
    foo(0);
    print "$1 $2\n";
}

foo(1);
[download]

The compiler looks at this code and sees one match, inside one block, the block for foo. It then generates one set of match variables and attaches it to all the nodes in the tree for the sub.

When we execute this code, foo is called. The first match takes place, and the match variables attached to the match node are filled in. Then we print them. That part's fine.

Next, the search string's redefined, and we call the sub recursively. (With a parameter to keep the recursion from going on forever) The match happens again, and the match variables attached to the match node are set to the new match results. The variables are printed, then the recursive invocation exits.

Then in the top level invocation we print the match variables again. And, interestingly, we get the values from the recursive call. Why? Well, remember we said the match variables were attached to nodes of the optree, the compiled version of your code. There's only one optree for the foo subroutine, no matter how many times we invoke it recursively. That means that no matter how many times we invoke it, we always are referencing the same variables, potentially stomping on previous values unknowingly.

This is also why returning references to match variables for later storage is an exercise in pain. The reference you return is, of course, a reference to this shared match variable, so each and every time you execute the code the variable came from you'll be overwriting it with a new value. (This doesn't affect returning the actual value, rather than a reference, since perl will make a copy just like it does for any other plain scalar)

This is also why match variables in closures can behave somewhat unusually. Multiple instances of the closure all share the same optree, since there's really only one optree for all the closure instances, which differ only in the scratchpad gets passed into the anonymous subroutine. Perl does some code to initialize the variables, sometimes, but it's possible to see old data left from previous invocations of different versions of the same closure.

In reply to Zen and the Art of Match Variables by Elian

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks