perlmeditation
Elian
Match variables are one of those things that sometimes cause much
head-scratching amongst people writing perl code. Are they globals?
Are they lexicals? What resets them, and when do they go away? What
the heck is really happening with these darned things, and why do odd
values show up sometimes?
<P>
Match variables are, safe to say, somewhat perplexing.
<readmore>
<P>
First, there's some history to understand. Regex match variables are
an old feature of perl, predating lexicals by quite a while. Lexical
variables only came into perl with perl 5, while match variables have
been in perl since at least version 3, and may date back to version 2
or 1. (Which predates me by a lot, so I'm not sure)
<P>
Match variables behave sort of like lexical variables, but not
entirely like lexicals. That means code like:
<code>
"foo" =~ m/(oo)/;
{
"bar" =~ /(ar)/;
}
print $1, "\n";
</code>
prints <tt>oo</tt>. That makes it look like match variables are
lexically scoped--the inner match didn't affect the outer scope's
match variables. But what happens when you try:
<code>
$search_string = "123abc";
sub foo {
$again = shift;
$search_string =~ /(\d+)(\w+)/;
print "$1 $2\n";
return unless $again;
$search_string = "456def";
foo(0);
print "$1 $2\n";
}
foo(1);
</code>
You'd think it prints
<code>
123 abc
456 def
123 abc
</code>
but it doesn't. Instead, it prints
<code>
123 abc
456 def
456 def
</code>
Why? Because the match variables aren't really lexical. What they are
is tied to perl's optree at compile time by perl's compiler. The
compiler tries to make them lexical, but there's a limit to what it
can do because it's a compile-time thing (naturally) while lexicals
have a runtime component. To understand what's going on, you have to
understand some of how perl compiles your program before running it,
and how match variables mix with that.
<P>
When perl compiles your program, it builds up a big tree structure,
called an optree, filled with nodes, called op nodes or opcodes. Each
node in the tree represents an action that perl must take. Nodes in
the tree can have a variety of things hanging off of them, including
the next node to take, the different nodes to take for conditional
tests (the true and false nodes), source code information, and regex
match variables.
<P>
That's right, the regex match variables are attached to the optree.
<P>
Lexical variables, on the other hand, live in a scratchpad, one pad
per sub, and every time a sub is called a new pad is (potentially)
allocated. That way recursive subs work out--each time you enter the
sub recursively a new pad is allocated. If that didn't happen, each
recursive invocation would reuse the same pad and stomp on variables,
which would be bad. Because lexicals live in a scratchpad, which is
separate from the code, multiple overlapping invocations of a sub
don't have their lexical variables collide.
<P>
Since the regex match variables live in the optree, rather than in a
scratchpad, there's only one copy per node. Many nodes may share the
same match variables, of course--the match node and the print node in
a statement like:
<code>
"foo" =~ /(oo)/;
print $1;
</code>
reference the same match variables.
<P>
Perl's compiler is pretty clever, and simulates lexicalness in most
cases. If your code is like our original example,
<code>
"foo" =~ m/(oo)/;
{
"bar" =~ /(ar)/;
}
print $1, "\n";
</code>
it works right. That's because the nodes that represent the code
inside the block reference different match variables than code outside
the block. That's lexical scoping, but it's <i>compile-time</i>
lexical scoping. Where that bites us in in our second example:
<code>
$search_string = "123abc";
sub foo {
$again = shift;
$search_string =~ /(\d+)(\w+)/;
print "$1 $2\n";
return unless $again;
$search_string = "456def";
foo(0);
print "$1 $2\n";
}
foo(1);
</code>
The compiler looks at this code and sees one match, inside one block,
the block for foo. It then generates one set of match variables and
attaches it to all the nodes in the tree for the sub.
<P>
When we execute this code, foo is called. The first match takes place,
and the match variables attached to the match node are filled in. Then
we print them. That part's fine.
<P>
Next, the search string's redefined, and we call the sub
recursively. (With a parameter to keep the recursion from going on
forever) The match happens again, and the match variables attached to
the match node are set to the new match results. The variables are
printed, then the recursive invocation exits.
<P>
Then in the top level invocation we print the match variables
again. And, interestingly, we get the values from the recursive
call. Why? Well, remember we said the match variables were attached to
nodes of the optree, the compiled version of your code. There's only
one optree for the foo subroutine, no matter how many times we invoke
it recursively. That means that no matter how many times we invoke it,
we always are referencing the same variables, potentially stomping on
previous values unknowingly.
<P>
This is also why returning references to match variables for later
storage is an exercise in pain. The reference you return is, of
course, a reference to this shared match variable, so each and every
time you execute the code the variable came from you'll be overwriting
it with a new value. (This doesn't affect returning the actual value,
rather than a reference, since perl will make a copy just like it does
for any other plain scalar)
<P>
This is also why match variables in closures can behave somewhat unusually. Multiple instances of the closure all share the same optree, since there's really only one optree for all the closure instances, which differ only in the scratchpad gets passed into the anonymous subroutine. Perl does some code to initialize the variables, sometimes, but it's possible to see old data left from previous invocations of different versions of the same closure.