http://qs321.pair.com?node_id=11115834

nysus has asked for the wisdom of the Perl Monks concerning the following question:

Was showing my kid some perl and was explaining to him what $_ and regexes were. I ran into something unexpected. I had him create the following to print the first two words of each line from a file:

open (my $fh, '<text.txt'); while (<$fh>) { /^([^ ]+) ([^ ]+)/; print "$1 $2" . "\n" if $1 && $2; }

It seems innocent enough. However, it gave me some unexpected results. If the file contains this:

hello one two three kjsf kjsd kjd
The output is:
hello one hello one hello one hello one

That's because no match is found on lines 2, 3 and 4 so it repeats $1 and $2 from line 1. However, if I change the program to this:

while (my $line = <$fh>) { $line =~ /^([^ ]+) ([^ ]+)/; print "$1 $2" . "\n" if $1 && $2; }
The output is the expected, single line:
hello one

So why does matching against $line reset $1 and $2 but matching against $_ does not?

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re: Matching against $_ behaves differently than matching against a named scalar?
by choroba (Cardinal) on Apr 20, 2020 at 16:36 UTC
    You can get the same behaviour for a named variable if you declare it outside the loop:
    my $line; while ($line = <>) { ...

    With the declaration inside the condition, it's in fact a different variable every time, so Perl needs to create an extra scope for it, as B::Deparse shows you:

    $ perl -MO=Deparse -e 'while (<>) { /(.)/ }' while (defined($_ = readline ARGV)) { /(.)/; } -e syntax OK $ perl -MO=Deparse -e 'while (my $line = <>) { $line =~ /(.)/ }' while (defined(my $line = readline ARGV)) { do { $line =~ /(.)/ }; } -e syntax OK

    Also note that lines containing 0 as their first or second word are not printed.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Matching against $_ behaves differently than matching against a named scalar?
by stevieb (Canon) on Apr 20, 2020 at 16:44 UTC

    choroba answered the why, I'll give a different way to do things.

    Instead of checking the condition of the variables after the fact, do it before hand:

    use warnings; use strict; open my $fh, '<', 'text.txt' or die $!; while (<$fh>) { if (/^([^ ]+) ([^ ]+)/) { # Skip this if $1 and $2 weren't populated print "$1 $2\n"; } }

    Also note the die() statement if the file can't be opened, and the use of 3-arg open().

    One last thing... in your former example, you're missing the + in the regex.

Re: Matching against $_ behaves differently than matching against a named scalar?
by jcb (Parson) on Apr 21, 2020 at 02:18 UTC

    Other monks have come close but have not said this specifically; choroba illustrated but did not explain why using a lexical instead of $_ produces a different result.

    In Perl, the regex capture variables ($1, $2, etc.) are implicitly local to every containing block, but retain their values within a block until the next successful match replaces them. Introducing a lexical in the loop header implicitly introduces another block scope, which means that the regex capture variables are implicitly reset on every loop iteration, (strictly, each loop iteration has its own set of regex capture variables) but your second example is also subtly different because you forgot to test defined(my $line = <$fh>), so a line that evaluates to a false value will cause that loop to terminate early.

    The regex match itself returns a boolean value indicating success in Perl, and standard practice is to test that return value to determine if the regex matched, rather than relying on the truth of the capture variables.

    Here's a slightly different example to illustrate:

    open (my $fh, '<text.txt'); while (<$fh>) { print "$1 $2" . "\n" if /^([^ ]+) ([^ ]+)/; }

    The exact rules for the regex capture variables are prickly, with lots of sharp edges, so good practice is to consider the regex capture variables only valid after a successful match until the next match is attempted and to have unspecified values at all other times.

    Edited by jcb: As davido pointed out, the defined test is implicit when an I/O operator is used in a loop test.

      I want to clarify something based on documentation from perlop:

      while (my $line = <STDIN>) { print $line }

      In these loop constructs, the assigned value (whether assignment is automatic or explicit) is then tested to see whether it is defined. The defined test avoids problems where the line has a string value that would be treated as false by Perl; for example a "" or a "0" with no trailing newline.

      So in this case the defined test doesn't need to be done explicitly, it's already being done implicitly.


      Dave

        You are correct. I had forgotten about that particular bit of DWIM, since I do not rely on it in my own code.

Re: Matching against $_ behaves differently than matching against a named scalar?
by rjt (Curate) on Apr 20, 2020 at 20:28 UTC

    You have a couple of suggestions already, plus the explanation for why $1 and $2 survive successive loop iterations. Here is how I would modify the code:

    use 5.010; use autodie; open my $fh, '<', 'text.txt'; while (<$fh>) { my @words = split /\s/; say "@words[0..1]" if @words >= 2; }

    Note the use of autodie to avoid having to do explicit error checking on open or reads. Also note the use of three-argument open, which is an important security best-practice. Not necessary in your example, since your filename is a literal, but it's a good habit to get into.

    It looks like you're really just splitting words on whitespace, so split seemed more natural and expressive to me. Maybe this was a contrived example to show the regex behaviour, and your real code really needs the regex, but for what's in front of me, split would be my choice.

    Finally, I like say, but you can of course use print, and drop the 5.010 requirement if you are going for maximum backward compatibility.

    Edit: My actual final word is, thanks for teaching your kid some Perl!

    use strict; use warnings; omitted for brevity.
Re: Matching against $_ behaves differently than matching against a named scalar?
by BillKSmith (Monsignor) on Apr 20, 2020 at 17:15 UTC
    I have duplicated your result. (Strawberry perl 5.24.1 on window 7) Very strange indeed! I am not clear which we should 'expect'. Is this an example of the 'nested block' referred to in the documentation of $<digits> in perlvar?
    Bill
Re: Matching against $_ behaves differently than matching against a named scalar?
by Marshall (Canon) on Apr 23, 2020 at 01:28 UTC
    As an general practice, I do not fiddle around with $1 and $2. I use list context to assign these variables to specific names. This avoids some complications and is not "expensive" in terms of CPU..
    use strict; use warnings; while (<DATA>) { if ( (my $first,my $second) = /^([^ ]+) ([^ ]+)/ ) { print "$first $second\n"; } } #prints: hello one __DATA__ hello one two three kjsf kjsd kjd
    Now of course the regex could be written differently. This means the same thing.
    use strict; use warnings; while (<DATA>) { if ( (my $first,my $second) = /^(\S+)\s+(\S+)/ ) #ok, allow an extra + spaces between tokens { print "$first $second\n"; } } #prints: hello one __DATA__ hello one two three kjsf kjsd kjd

      (my $first,my $second) = /^([^ ]+) ([^ ]+)/

      Eeew :P

      if( my( $first, $last) = $line =~ /^([^ ]+) ([^ ]+)/ ){ }
Re: Matching against $_ behaves differently than matching against a named scalar?
by rsFalse (Chaplain) on Apr 23, 2020 at 21:53 UTC
    A possible way to overcome it without using a lexical variable inside a loop: to make a successful match to reset $1, $2...
    /^([^ ]+) ([^ ]+)/ or /(*ACCEPT)/; # or /(?=)/; print "$1 $2" . "\n" if $1 && $2;
    Also, it is possible to 'control flow', being inside regex:
    /^([^ ]+) ([^ ]+)(?{ print "$1 $2" . "\n" })/;
    That way of using regex (with (?{ <code> }) construct) is useful for debugging.
Re: Matching against $_ behaves differently than matching against a named scalar? (use re 'debug';)
by Anonymous Monk on Apr 21, 2020 at 12:14 UTC
    use re 'debug';