clueless newbie has asked for the wisdom of the Perl Monks concerning the following question:
I've taken this from Conway's "Everything You Know About Regexes Is Wrong", but I can't get it to behave.
#!/usr/bin/env perl
use 5.01800;
use warnings;
my @cases=(
"<1>"
,"<1,2>"
,"<1,2,<3,4>,5,6>"
);
my $re=qr{(?x) (?&LIST)
(?(DEFINE)
(?<LIST> < (?&ITEM) (?: , (?&ITEM))*+ > )
(?<ITEM> \d*+ | (?&LIST) )
)
};
for my $case (@cases) {
say qq{$case\n},
$case =~ qr{$re} ? "matches: '$&'" : "doesn't match";
};
__END__
which yields
perl Conway_01.pl
<1>
matches! <1>
<1,2>
matches! <1,2>
<1,2,<3,4>,5,6>
matches! <3,4> !!!! shouldn't this be <1,2,<3,4>,5,6>
What have I got wrong?
Re: regex for nested "<"/">'
by TheDamian (Vicar) on Feb 12, 2020 at 09:31 UTC
|
What have I got wrong?
You mistranscribed that \d*+.
It should be \d++.
(I checked back through all of my notes and presentations,
and its definitely shown as \d++ in them.)
The nested match goes wrong when you use \d*+,
because when it encounters a nested < the
\d*+ successfully matches zero times.
And, because the \d*+ | (?&LIST) is itself in
a non-backtracking loop: (\d*+ | (?&LIST) )*+,
when the zero-length submatch causes the main match to fail, the regex engine
can't backtrack into the alternative and try the (?&LIST) instead.
So the original match from the start of the string fails, and
the regex engine skips down the string, trying again and again,
until it finds the nested sublist, which it is able to match.
Incidentally, you could have watched this happen live and in colour
via the Regexp::Debugger module. Just download it from CPAN and
add it in front of your regex:
use Regexp::Debugger;
my $re=qr{(?x) (?&LIST)
(?(DEFINE)
(?<LIST> < (?&ITEM) (?: , (?&ITEM))*+ > )
(?<ITEM> \d*+ | (?&LIST) )
)
};
Damian | [reply] [d/l] [select] |
|
| [reply] [d/l] |
Re: regex for nested "<"/">'
by choroba (Cardinal) on Feb 11, 2020 at 21:24 UTC
|
Sometimes, a parser is a better tool than a regex. I like Marpa::R2:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use Marpa::R2;
my $dsl = << '__DSL__';
lexeme default = latm => 1
:default ::= action => ::first
List ::= ('<') Items ('>')
Items ::= Item action => list
| Item (',') Items action => merge
Item ::= List
| digits
digits ~ [\d]+
__DSL__
sub list { [ $_[1] ] }
sub merge { [ $_[1], @{ $_[2] } ] }
my $grammar = 'Marpa::R2::Scanless::G'->new({ source => \$dsl });
my @cases=('<1>',
'<<1>>',
'<1,2>',
'<1,2,<3,4>,5,6>',
'<<<1,2>,3>,<4,5,<6>>,<<7>,8,9>>');
for my $case (@cases) {
my $value_ref = $grammar->parse(\$case, 'main');
use Data::Dumper; print Dumper $$value_ref;
}
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: regex for nested "<"/">'
by haukex (Archbishop) on Feb 11, 2020 at 20:51 UTC
|
In the regex, changing \d*+ to \d++ fixes it, although I have to admit I don't see why yet, it'll take a bit more staring at it to figure that out... Anyway:
WebPerl Regex Tester Link
Personally, I'd use Regexp::Common for this kind of task:
use warnings;
use strict;
use Regexp::Common 2013030901 qw/balanced/;
my $re = qr/$RE{balanced}{-parens=>'<>'}/;
print $re, "\n";
my @cases = ( "<1>" ,"<1,2>" ,"<1,2,<3,4>,5,6>" );
for my $case (@cases) {
print $case, " ",
$case =~ $re ? "matches: '$1'\n" : "doesn't match\n";
}
__END__
(?^:((?:\<(?:(?>[^\<\>]+)|(?-1))*\>)))
<1> matches: '<1>'
<1,2> matches: '<1,2>'
<1,2,<3,4>,5,6> matches: '<1,2,<3,4>,5,6>'
| [reply] [d/l] [select] |
Re: regex for nested "<"/">'
by clueless newbie (Curate) on Feb 12, 2020 at 15:12 UTC
|
| [reply] |
|
For balanced begin/end matching, this might do:
$re = qr{
(
(?:
(?&OPEN)
(?:
(?>(?:.*?(?=(?&OPEN)|(?&CLOSE))))
|
(?-1)
)*
(?&CLOSE)
)
)
(?(DEFINE)
(?<OPEN>\bBEGIN\b)
(?<CLOSE>\bEND\b)
)}x;
| [reply] [d/l] |
|
| [reply] |
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11112811
use warnings;
my $input = <<ENDOFSTRING;
beginning a bend
before stuff
begin
one
begin
three
end
five
end
after stuff
beginning a bend
ENDOFSTRING
print $input, '-' x 70, "\n";
my $level = 0;
my $nested = $input =~ s/(\bbegin\b)|(\bend\b)|./
( $1 && $level++, $level ? $& : '', $2 && $level--)[1] /gesr;
print "$nested\n";
Outputs:
beginning a bend
before stuff
begin
one
begin
three
end
five
end
after stuff
beginning a bend
----------------------------------------------------------------------
begin
one
begin
three
end
five
end
| [reply] [d/l] [select] |
|
|