Understanding this regex

BradV has asked for the wisdom of the Perl Monks concerning the following question:

I recently needed to write some code to pull a single (and first) dn out of each line in an LDAP file. Each dn is contained between '<>' and always begins with capital CN=. I used:

#!/usr/bin/perl -w
my @dn;
open FILE, "crap" or die $!;
while (<FILE>) {
   chomp $_;
   @dn = $_ =~ /(<CN=.*?>)/;
   print "$dn[0]\n";
}
close (FILE);
[download]

This works great. The only part I'm not sure about is the parenthesis. If I remove them, then I just get in $dn[0] the value 1 which says that yes, the regex was present. By putting the parenthesis in, I instead get the actual match. Could someone give me an explanation for that please?

Thanks!

Comment on Understanding this regex Download Code

Replies are listed 'Best First'.
Re: Understanding this regex by rjt (Curate) on Jun 04, 2013 at 11:58 UTC
The parenthesis serve two purposes: they can provide a grouping for a sub-expression`[1]`, but they also create "capture groups", which are placed in numbered variables `$1, $2, ...`, or in the `%+` hash if the new(ish) named capture groups feature is employed. But they also have the effect that when a regex is evaluated in list context, the capture groups are returned as a list. Try this: `my $date = '2013-06-04 01:23:00'; # June 4th $date =~ /^((\d{4})-(\d{2})-(\d{2})) ((..):(..):(..))$/;` [download] The capture groups are numbered according to the order in which their opening paren is. Hence, the following would be true: `my $date = $1; # 2013-06-04 my $yyyy = $2; # 2013 my $mm = $3; # 06 my $dd = $4; # 04 my $time = $5; # 01:23 my $hh = $6; # 01 my $mm = $7; # 23` [download] Similarly, in list context: `use Data::Dump; my @a = $date =~ /^((\d{4})-(\d{2})-(\d{2})) ((..):(..):(..))$/; dd @a; __END__ ("2013-06-04", 2013, "06", "04", "01:23:00", 23, "00")` [download] Note that the array `@a` now contains the same values as `$1..$7` at array positions `0..6` Hope this helps. As always, the Perl documentation is an excellent source of more detailed information: perlre and perlretut are good starting points. `[1]` - If you are using parens for a sub-expression and do not require that expression to be captured into a `$n` capture variable (e.g., `$color =~ /^(?:black\|white\|red\|green\|blue)$/`), note I have used `(?:...)` in this example: this prevents the creation of a capture group, so the color would not be put into `$1`. This example is obviously contrived, but judicious use of `(?:...)` can result in performance improvements as well as increased clarity in your code.	[reply] [d/l] [select]
Re: Understanding this regex by choroba (Cardinal) on Jun 04, 2013 at 11:48 UTC
See Regexp Quote Like Operators: If the `/g` option is not used, `m//` in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, that is, ($1 , $2 , $3 ...). (...) When there are no parentheses in the pattern, the return value is the list `(1)` for success. With or without parentheses, an empty list is returned upon failure. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: Understanding this regex by BradV (Sexton) on Jun 06, 2013 at 13:37 UTC
Thanks to all for the explanations and suggested enhancements! :) That helps a lot!	[reply]
Re: Understanding this regex by hbm (Hermit) on Jun 04, 2013 at 13:38 UTC
For "a single (and first) dn", use a scalar (not array) and exit the loop once you get it: `my $dn; while (<FILE>) { next unless /(<CN=.*?>)/; $dn = $1; last; }` [download]	[reply] [d/l]
Re: Understanding this regex by Anonymous Monk on Jun 05, 2013 at 04:42 UTC
Simpler: `sub get_dn { local#($filename); @ARGV = @_; map# <>, #gx, m# (<CN=.*?>) #gx, <>, } my($dn1) = get_dn('crap');` [download]	[reply] [d/l]
Re^2: Understanding this regex by Anonymous Monk on Jun 05, 2013 at 05:00 UTC
Simpler: Sure, also riskier (more dangerous) :) also you didn't local-ize $^I Also, '#' is the worst choice for a m//atch or s///ubstitution delimiter, it means you can't use # to comment your regular expression Never use '#' '$' '@' and '\\' as delimiters, they're the worst possible choices	[reply]
Re: Understanding this regex (perlintro) by Anonymous Monk on Jun 04, 2013 at 23:01 UTC
Could someone give me an explanation for that please? perlintro can, as can perlrequick	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks