Re^4: Explain a regexp matched group result

Replies are listed 'Best First'.
Re^5: Explain a regexp matched group result by ig (Vicar) on Oct 28, 2013 at 19:31 UTC
I think your conclusion is correct in general, but if you know the structure of the RE then there are workarounds to the way the capture groups work. Consider: #!C:/strawberry/perl/bin/perl.exe # use strict; use warnings; my $string = "aacbbbcac"; my $re1 = qr/((a+)?(b+)?(c))/; #my $re1 = qr/((a)(b)(c))/; #my $re1 = qr/((a+)?(b)(c))/; if ($string =~ $re1) { my $start = 0; my @something; foreach (0..$#-) { if(defined($-[$_])) { $start = $-[$_] if($-[$_] > $start); if($-[$_] >= $start) { printf "Group %d: <%s>\n", $_, substr($string, $-[$_], $+[ +$_] - $-[$_]); $something[$_] = substr($string, $-[$_], $+[$_] - $-[$_]); } else { printf "Group %d: <%s> - but ignore it because it is from +a previous iteration of the outer capture group\n", $_, substr($strin +g, $-[$_], $+[$_] - $-[$_]); $something[$_] = ''; } } else { printf "Group %d: hasn't matched yet\n", $_; $something[$_] = ''; } } print "$1 = " . join('', @something[2..4]) . "\n"; } [download] Which produces `Group 0: <aacbbbcac> Group 1: <ac> Group 2: <a> Group 3: <bbb> - but ignore it because it is from a previous iteration + of the outer capture group Group 4: <c> ac = ac` [download] If you are trying to write something that handles arbitrary REs, this approach is unlikely to work.	[reply] [d/l] [select]
Re^6: Explain a regexp matched group result by jdd (Acolyte) on Oct 28, 2013 at 19:45 UTC
Formidable! How to say yes: I want to handle arbitraries REs but only REs that passed my grammar, so I know in advance the overall structure. Complicated but perfectly doable. The origin of my question is the comparison of perl's regexp with ECMAScript regexp, namely Note 3 of chapter 15.10.2.5 of ECMA-262 grammar spec (a perfectlty sensible question since ECMA regexp are are copy of perl5's). I have done the AST of any ECMAScript source (c.f. MarpaX::Languages::ECMA::AST) so now I was wondering about the actions associated to the grammar.	[reply]
Re^7: Explain a regexp matched group result by ig (Vicar) on Oct 28, 2013 at 20:52 UTC
The note is interesting: they are highlighting this difference between ECMA script REs and Perl REs. The RE `(x+)?` is very similar to `(x)`, except that the latter will always match (and, therefore, never have the value from a previous match if it is in an enclosing repeating group. This is similar to the requirement in Note 3: "Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated." Because it always matches it always has a value from the last repeat of the outer repeating group, as if it was reset for each repeat, except that the value is `''` instead of `undef` in the case that `x` did not match. This is an easy transformation. I appreciate that you don't want to change the RE but you say you are parsing it, so perhaps you can make some systematic transformations. Consider: use strict; use warnings; use Data::Dumper::Concise; my $string = "aacbbbcac"; my $re = '((a+)?(b+)?(c))'; # transform '(x+)?' to '(x)' assuming 'x' is monolithic $re =~ s/\Q+)?/)/g; print "re = $re\n"; my $re1 = qr/$re/; if ($string =~ $re1) { my @something; foreach (0..$#-) { if(defined($-[$_])) { my $substring = substr($string, $-[$_], $+[$_] - $-[$_]); # ${$_} also works, except where $_ = 0 no strict 'refs'; print "\$substring = $substring = ${$_}\n"; # transform '' to undef $substring = undef if($substring eq ''); # assert: $substring is now as specified by # Standard ECMA-262, 5.1 Edition / June 2011 # Section 15.10.2.5 Note 3 printf "Group %d: <%s>\n", $_, $substring // ''; $something[$_] = $substring; } } print Dumper(\@something); } [download] Produces `re = ((a)(b)(c))* $substring = aacbbbcac = test.pl Group 0: <aacbbbcac> $substring = ac = ac Group 1: <ac> $substring = a = a Group 2: <a> $substring = = Group 3: <> $substring = c = c Group 4: <c> [ "aacbbbcac", "ac", "a", undef, "c" ]` [download]	[reply] [d/l] [select]


P is for Practical
	PerlMonks