http://qs321.pair.com?node_id=1109947

MidLifeXis has asked for the wisdom of the Perl Monks concerning the following question:

I asked a question on CB today if the benefits of using named capture groups outweigh the overhead, especially from a maintenance view. Given the varied answers and opinions, as well as encouragement to make a post for a wider range of comments, I am posting the question here. Specifically, I am breaking an almost free form logfile format into tokens (FlexLM, for those that are interested). Each line in the log file may be in one of many different formats.

The two general forms I am looking at are...

my $re_foo = qr{ ... (?<type> ... ) ... (?<name1> ... ) }x; my $re_bar = qr{ ... (?<name2> ... ) ... (?<type> ... ) ... }x; my $re_all = qr{$re_foo|$re_bar}; if ( $data =~ $re_all ) { return { %+ }; } ...

vs

my $re_foo = qr{ ... ( # type ... ) ... ( # name1 ... ) }x; my $re_bar = qr{ ... ( # name2 ... ) ... ( # type ... ) ... }x; if ( $data =~ $re_foo ) { return { type => $1, name1 => $2 }; } elsif ( $data =~ $re_bar ) { return { type => $2, name2 => $1 }; } ...

The first seems to me to be much more maintainable, even if performance is impacted a bit. What other opinions, comments, or concerns are there about this construct?

--MidLifeXis

Replies are listed 'Best First'.
Re: Named captures or positional variables
by tobyink (Canon) on Dec 11, 2014 at 11:21 UTC

    It depends a lot on your situation.

    If you need to support Perl older than 5.10, then named captures are out. Full stop.

    If you're in a tight loop, using positional variables, or assigning the result of the regexp match to a list of variables, will be faster than named captures which use a tied hash.

    In the case where you're passing around regexps as part of an API, then named captures seem the best idea. For example, you have a function which accepts a filehandle and a regexp, and does some processing on the file, using the regexp to extract the right data.

    sub process_file { my ($fh, $re) = @_; while (<$fh>) { next unless /$re/; # do stuff with captured data } }

    Here named captures make a lot more sense because they give the caller a lot more flexibility. If you have something like:

    my $account_number = $1; my $deposit_amount = $2;

    ... then it ties the input format so that the deposit amount can never be in the first column, before the account number. Named captures don't suffer from that.

Re: Named captures or positional variables
by LanX (Saint) on Dec 10, 2014 at 21:04 UTC
    Maybe off topic, but I'm meditating about the most elegant way to put a regex into a $var such that its automatically expanded to "(?<var>$var)" when constructing bigger REs.

    Maybe with a tied hash %N redefining fetch...

    This would not only be DRY but also allow full control about all capture names...

    Cheers Rolf

    (addicted to the Perl Programming Language and ☆☆☆☆ :)

      Maybe something like use revar name => "regex";

      or use revar '$name' => "regex";

      (Which are similar to use constant or use vars)

      Note: I suspect this would create "our" variables. Is it possible for a pragma to create "my" variables in the use-ing package?

        > Is it possible for a pragma to create "my" variables in the use-ing package?

        If lexical variables are already declared in the calling package, you can change their value, e.g. with PadWalker.

        Otherwise it wouldn't compile under strict if you try to use them.

        Cheers Rolf

        (addicted to the Perl Programming Language and ☆☆☆☆ :)

Re: Named captures or positional variables
by Anonymous Monk on Dec 10, 2014 at 20:34 UTC

    Nowadays I'd only use $1 etc. if they are used within something like 5-10 lines of the regex. Even then I often find myself writing my ($foo,$bar) = $str =~ /^(\w+)\s+(.+)/; instead of using $1 and friends. I think the only time I've used $1 recently is in a construct like if (/^\w+\s+(.+)/) { my $foo = $1; .... In your example case, I would almost certainly use named capture groups, since it looks like the regexes are defined nowhere near the $1 variables; I'd say there's too much of a risk of someone accidentally adding a capture group somewhere and throwing off all of the indicies of the other groups. Named capture groups can also help someone trying to understand a regex later on. So in general, I find having things named, especially stored in lexical variables (i.e. watched by strict), is always preferable from a maintenance standpoint. I usually only back down from that when the scripts are really compact and/or it's a throwaway script. As for performance, you know what they say about optimization ;-)

      Funny you should mention that (performance). I just did benchmarks on the two parsers against a test data set. Only difference is the module used to parse the lines of text from the log files, parsers are set up outside of the benchmark, single initial run to prime to I/O buffers, etc. The results were within 2-10% either way, depending on the data set - $1 and company in the lead, but negligible in the big picture. Since the log files are stored on disk, I/O is the limiting factor at the moment. The parsing uses less than 15% of a single processor.

      --MidLifeXis

Re: Named captures or positional variables
by RonW (Parson) on Dec 10, 2014 at 20:54 UTC

    I think it depends on the expression the captures are embedded in. Without seeing the whole expressions, your example appears to benefit from named captures because of the varying order that equivalent "fields" occur. However, readability could be an issue, even with the /x option - and even when splitting expressions in to separately defined sub-expressions.