http://qs321.pair.com?node_id=322697

Not_a_Number has asked for the wisdom of the Perl Monks concerning the following question:

I was experimenting with parsing a flat file like in __DATA__ below, and I tried out the following (NB the snippets below work with strict and warnings, removed here for brevity):

my %totals; while ( <DATA> ) { chomp; if ( /(^[a-z].*$)/i ) { $totals{$1} = 0 unless $totals{$1}; } else { $totals{$1} += $_; } } print "$_: $totals{$_}\n" for keys %totals; __DATA__ player1 11 22 11 player2 10 21 player1 22

This outputs, as expected(?):

player1: 66 player2: 31

Similarly, the following works:

while ( <DATA> ) { if ( /^([A-Z]+)$/ ) { print "In if: $1\n"; } else { print "In else: $1\n"; } } __DATA__ FOO 1234 xyz BAR dfgdfg

Output:

In if: FOO In else: FOO In else: FOO In if: BAR In else: BAR

But if I change my data to:

__DATA__ FOO 1234 Xyz BAR DFGdfg

(Note capitalisation changes in lines 3 and 5), the output is:

In if: FOO In else: FOO In else: F In if: BAR In else: B

...whereas I would have expected the result to be the same.

From perldoc perlre:

The numbered variables ($1, $2, $3, etc.) <snip> are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first.
But $1 can't have gone out of scope (or could it?) or the original snippet wouldn't have worked; and there can't have been a 'next successful match' (or could there?) since the code fell through to the else clause??

Additionally, if I change (for example) my last data item to a string that begins with something other than a capital letter but then contains at least one capital (eg 'dFgdfg' or '2FGdfg)', I get an 'uninitialized value' warning, because $1 is empty.

Since I get no warnings for 'Xyz' and 'DFGdfg', why does $1 apparently just contain the first capital letter of the previous match in these cases?

Or am I missing something obvious?

(Tested on Win XP/AS 5.61 and Mandrake Linux/5.81)

TIA

Yours,

Confused of Paris

Edited by Chady -- added readmore tags.

Replies are listed 'Best First'.
Re: What's happening to my $1?
by blokhead (Monsignor) on Jan 20, 2004 at 20:49 UTC
    You are trying to use $1 after a failing match, so basically you deserve whatever's coming to you ;) The contents of the ${\d+} variables are only well-defined if the last (capturing) match was successful. The fact that you were getting the old value was lucky. You should just save $1 after a successful match if you will need it later:
    my $last_match; while ( <DATA> ) { if ( /^([A-Z]+)$/ ) { print "In if: $1\n"; $last_match = $1; } else { print "In else: $last_match\n"; } } __DATA__ FOO 1234 Xyz BAR DFgdfg
    This produces the output you want.

    Speculation: As for why it's doing this, I have a guess that as the regex engine goes left to right across the string, it starts matching and filling up the buffer for $1 with uppercase characters, clobbering what was in it before. It doesn't fail until it gets to a lowercase character (when the regex is expecting the end of string), but $1 is already trashed. When the non-matching strings in your __DATA__ started with lowercase letters, the regex could fail before even trying to fill the buffer for $1, so it was not clobbered and the old value remained.

    Why you still ended up getting exactly the old first character though is a mystery to me.

    blokhead

      You are trying to use $1 after a failing match, so basically you deserve whatever's coming to you ;)

      Thanks, blokhead. That's more or less what I'd intuited. But, in that case:

      1) Why doesn't the 'failing match' clobber $1 in my first two snippets?

      2) With respect to the docs:

      The numbered variables ($1, $2, $3, etc.) ... are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first.

      This seems to suggest that $1 (provided that it stays in scope) should not change until a match succeeds, rather than being clobbered if a match fails (which, I'm sure you'll agree, is not the same thing...).

      Why you still ended up getting exactly the old first character though is a mystery to me.

      To me too...

      Still Confused,

      dave

Re: What's happening to my $1?
by pg (Canon) on Jan 21, 2004 at 00:16 UTC

    I would agree with you that, this is a problem, as the behavior is not consistant.

    However this can also be looked at from a different angle. Although it is not clearly stated, or only vaguely mentioned, it is unsafe to use $1 after a failed match, and it is up to you to execise your caution.

    I consider this as some sort of grey area. It will be nice if the consistancy is there, but that kind of consistancy was not clearly promised.

    By the way, in your first piece of code, there is no need to initialize the hash elements to 0. This is Perl ;-)

    my %totals; while ( <DATA> ) { chomp; if ( /(^[a-z].*$)/i ) { #I deleted the initialization here, and you still get what you expe +cted } else { $totals{$1} += $_; } } print "$_: $totals{$_}\n" for keys %totals; __DATA__ player1 11 22 11 player2 10 21 player1 22
      By the way, in your first piece of code, there is no need to initialize the hash elements to 0. This is Perl ;-)

      Thanks, you're right, of course. However, I happen to hate empty if blocks.

      Don't worry, it's just a personal thing ;-). So I re-wrote my original code (which I'll probably never use again since it's proved so fragile):

      my %totals; while ( <DATA> ) { chomp; /(^[a-z].*$)/i or $totals{$1} += $_; }

      Thanks again for the tip.

      dave

quantifier optimization and dynamic scoping?
by sleepingsquirrel (Chaplain) on Jan 21, 2004 at 00:03 UTC
    That's definitely a strange situation you've got there. The problem has something to do with the quantifier in the regex. When I replace /^([A-Z]+)$/ with /^([A-Z]*)$/ (i.e. change the '+' to a '*') I get...
    In if: FOO In else: In else: In if: BAR In else:
    while changing the regex to /^([A-Z]{2,})$/ results in...
    In if: FOO In else: FOO In else: In if: BAR In else: BA
    While experimenting with {0,} {1,} {3,}, etc. it seems like in the non-matching case, $1 contains characters from the previous match and the number of them depends on the value for the minimum quantifier of the regex . I wouldn't try to depend on this behaviour if I were you:-)
Re: What's happening to my $1?
by ysth (Canon) on Jan 21, 2004 at 09:26 UTC
    I had always thought (as others comment) that $1 and friends weren't reliable after a failed match, but I note that the following was added to perlre beginning with 5.8.1:
    NOTE: failed matches in Perl do not reset the match variables, which makes easier to write code that tests for a series of more specific cases and remembers the best match.
    But perlre also says:
    The numbered match variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, $', and $^N) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See perlsyn/"Compound Statements".)
    and in your case you are actually leaving and reentering the enclosing block, which seems to make a difference:
    $ perl -we'$_="FOO"; /^([A-Z]+)$/, print $1; $_="Foo"; /^([A-Z]+)$/; p +rint $1' FOOFOO $ perl -we'for ("FOO","Foo") {/^([A-Z]+)$/; print $1 }' FOOF
    though it would be nice if it failed consistenly rather than sometimes working and other times not.
Re: What's happening to my $1?
by duff (Parson) on Jan 21, 2004 at 06:33 UTC

    Others have fleshed out the whys and wherefors wrt the behavior you've experienced, but I just wanted to mention that perhaps some important information has left the perl documentation. For as long as I can remember (since 1992ish) I've always followed the practice that the numbered vars are only good until the next match that contains parenthesised bits (regardless of that match's success or failure). I think I read in the perl man page at the time or perhaps in the pink camel that this was the only safe way to proceed and it's never steered me wrong. Perhaps that has gone missing from the docs (I didn't see it just now when I looked) or maybe perl is actually supposed to not clobber $1 and friends unless the entire pattern matches. Either way, I think a doc patch is necessary.

Re: What's happening to my $1?
by Anonymous Monk on Jan 21, 2004 at 05:09 UTC
    In your second set of data, the "Xyz" causes the engine to start matching with the "X", but the rest of the string causes the match to fail, and thus the last read match is $1, i.e. "FOO". However, since the matching engine started a new match, the number of matched characters was reset, so that essentially there is only one valid character reported; hence, what's available in $1 is "F". Personally, I wouldn't rely on what is not matched in $1, but would rather re-evaluate the string should the initial match fail. YMMV. TMTOWTDI.
Re: What's happening to my $1?
by ignatz (Vicar) on Jan 21, 2004 at 13:32 UTC