http://qs321.pair.com?node_id=1060041

in reply to Re^2: Explain a regexp matched group result
in thread Explain a regexp matched group result

From perlvar:

\$-[0] is the offset of the start of the last successful match.

You have my \$re1 = qr/((a+)?(b+)?(c))*/;

Your outer capture group may be repeated zero or more times. In the case of your test string, "aacbbbcac", it matches three times:

At the first repeat, it matches "aac" with \$1 being "aac", \$2 being "aa", \$3 being undef (not matched) and \$4 being "c", but because of the repeat count it doesn't stop there, so you never see these values.

At the second repeat \$1 is "bbbc", \$2 remains "aa" (group (a+) didn't match in this repeat but \$2 is the 'last successful match'), \$3 is "bbb" and \$4 is "c", but it doesn't stop there, so you don't see these values either.

The third and final repeat sets \$1 to "ac", \$2 to "a" leaves \$3 as it was at the last successful match (i.e. "bbb") and sets \$4 to "c".

So, the issue is that the capture groups return the last successful match rather than the last match or failure as the case may be.

Replies are listed 'Best First'.
Re^4: Explain a regexp matched group result
by jdd (Acolyte) on Oct 28, 2013 at 19:24 UTC
Thank you very much. So I cannot get individuals without removing the repetition and do a while ()... i.e.
use strict; use warnings FATAL => 'all'; my \$string = "aacbbbcac"; my \$re1 = qr/((a+)?(b+)?(c))/; while (\$string =~ /\$re1/g) { foreach (0..\$#-) { printf "Group %d: <%s>\n", \$_, defined(\$-[\$_]) ? substr(\$string, \$ +-[\$_], \$+[\$_] - \$-[\$_]) : ''; } print "\n"; }
Group 0: <aac> Group 1: <aac> Group 2: <aa> Group 3: <> Group 4: <c> Group 0: <bbbc> Group 1: <bbbc> Group 2: <> Group 3: <bbb> Group 4: <c> Group 0: <ac> Group 1: <ac> Group 2: <a> Group 3: <> Group 4: <c>

I think your conclusion is correct in general, but if you know the structure of the RE then there are workarounds to the way the capture groups work. Consider:

#!C:/strawberry/perl/bin/perl.exe # use strict; use warnings; my \$string = "aacbbbcac"; my \$re1 = qr/((a+)?(b+)?(c))*/; #my \$re1 = qr/((a*)(b*)(c))*/; #my \$re1 = qr/((a+)?(b*)(c))*/; if (\$string =~ \$re1) { my \$start = 0; my @something; foreach (0..\$#-) { if(defined(\$-[\$_])) { \$start = \$-[\$_] if(\$-[\$_] > \$start); if(\$-[\$_] >= \$start) { printf "Group %d: <%s>\n", \$_, substr(\$string, \$-[\$_], \$+[ +\$_] - \$-[\$_]); \$something[\$_] = substr(\$string, \$-[\$_], \$+[\$_] - \$-[\$_]); } else { printf "Group %d: <%s> - but ignore it because it is from +a previous iteration of the outer capture group\n", \$_, substr(\$strin +g, \$-[\$_], \$+[\$_] - \$-[\$_]); \$something[\$_] = ''; } } else { printf "Group %d: hasn't matched yet\n", \$_; \$something[\$_] = ''; } } print "\$1 = " . join('', @something[2..4]) . "\n"; }

Which produces

Group 0: <aacbbbcac> Group 1: <ac> Group 2: <a> Group 3: <bbb> - but ignore it because it is from a previous iteration + of the outer capture group Group 4: <c> ac = ac

If you are trying to write something that handles arbitrary REs, this approach is unlikely to work.

Formidable!

How to say

• yes: I want to handle arbitraries REs
• but only REs that passed my grammar, so I know in advance the overall structure.
Complicated but perfectly doable.

The origin of my question is the comparison of perl's regexp with ECMAScript regexp, namely Note 3 of chapter 15.10.2.5 of ECMA-262 grammar spec (a perfectlty sensible question since ECMA regexp are are copy of perl5's).

I have done the AST of any ECMAScript source (c.f. MarpaX::Languages::ECMA::AST) so now I was wondering about the actions associated to the grammar.