Regex to match text in broken parens

Rodster001 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex to match text in broken parens by choroba (Cardinal) on Oct 31, 2014 at 18:53 UTC
The following code passes your tests. It uses a single if and a hash ref as a "poor man's switch": `for (@test) { if (/ .* (^\|$) (.*?) ($\|$) /x and my $p = "$1$3") { print ' Match ', { ')' => 'before right paren', '(' => 'after left paren', '()' => 'in parens', }->{$p}, ": $2\n"; } }` [download] It's not clear, though, what should happen if the parentheses were nested. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Regex to match text in broken parens by Rodster001 (Pilgrim) on Oct 31, 2014 at 19:25 UTC
Exactly what I was looking for, thanks! (btw, nested parens won't be a problem with the data I am working with)	[reply]
Re^3: Regex to match text in broken parens by ww (Archbishop) on Oct 31, 2014 at 21:21 UTC
Famous last words! `++$anecdote ne $data`	[reply] [d/l]
Re: Regex to match text in broken parens by davido (Cardinal) on Nov 01, 2014 at 07:38 UTC
I really don't mind using more than one regex for this one. You're dealing with more than one rule, so there's a nice symmetry; each rule has corresponding code. If you are concerned with it being verbose where you want it to be terse, move the work out to a subroutine. Anyway, with those ideas, here's my version: use Test::More; my @test = ( [ '1 This (is a test) with good parens' => 'is a test', 'Match in parens' ], [ '2 This is a (test with broken a paren' => 'test with broken a par +en', 'Match after left paren' ], [ '3 And this would be one) the other way' => '3 And this would be o +ne', 'Match before right paren' ], [ '4 Lastly, no parens' => '', 'No match' ], ); foreach my $test (@test) { my $got = match( $test->[0] ); is( $got, $test->[1], "$test->[2]: <<$got>>" ); } done_testing(); sub match { for (shift) { m/ $([^)]?)$ /x && return $1; # Both parens. m/ $(.)$ /x && return $1; # Left paren. m/ ^(.)$ /x && return $1; # Right paren. m/ ^[^()]()$ /x && return $1; # No parens (no capture). return; # Unreachable. } } [download] Update: As often happens, I just have to go to bed to have an idea disturb me. Here's an improvement (I think) on sub match: `sub match { local $_ = shift; m/ $([^)]?)$ /x # Both parens. \|\| m/ $(.)$ /x # Left paren. \|\| m/ ^(.)$ /x # Right paren. \|\| m/ ^[^()]()$ /x; # No parens (no capture). return $1 // (); }` [download] Here's another version that combines the logic above into a single regex using alternation. I don't necessarily think this is better; I prefer the simplicity of breaking things into smaller regexes. `sub match { shift =~ m/ (?: [^(]$(?<C>[^)]?)$ ) # Both parens. \| (?: $(?<C>.)$ ) # Left paren. \| (?: ^(?<C>.)$ ) # Right paren. \| (?: ^[^()](?<C>)$ ) # No parens (empty capture +). /x; return $+{C} // (); }` [download] By using named captures we avoid the problem where other single-regex solutions result in either `$1`, or `$2`, or `$3` being populated. That's too much to keep track of, and could be error prone. Instead, we name every capture the same: `$+{C}`. (Warning: After checking perlre, I'm of the vague and uncertain impression that this could rely on undefined behavior.) Update:* Having a little fun with this. Here are two more options with subtle changes from the previous. The next example eliminates named captures. This would present a problem: The numeric match variable that accepts the capture could be `$1`, `$2`, or `$3`. choroba avoids this issue by concatenating all possible numeric match variables, but that means possibly interpolating undef, and feels a little dirty (but it is clever). We can avoid that by using `$^N`, which will contain the most recent submatch. `sub match { shift =~ m/ (?: [^(]$([^)]?)$ ) # Both parens. \| (?: $(.)$ ) # Left paren. \| (?: ^(.)$ ) # Right paren. \| (?: ^[^()]()$ ) # No parens (empty capture). /x; return $^N // (); }` [download] This next one wraps all the alternation branches in the `(?\|...)` branch reset construct. That means that each alternate will use the same `$1`, which is actually the closest I can come to the multiple-regex solutions I originally presented, but within a single regex. `sub match { shift =~ m/ (?\| (?: [^(]$([^)]?)$ ) # Both parens. \| (?: $(.)$ ) # Left paren. \| (?: ^(.)$ ) # Right paren. \| (?: ^[^()]()$ ) # No parens (empty capture). ) /x; return $1 // (); }` [download] And finally we can remove the grouping `(?...)` parens, because alternation is already very low precedence: `sub match { shift =~ m/ (?\| [^(]$([^)]?)$ # Both parens. \| $(.)$ # Left paren. \| ^(.)$ # Right paren. \| ^[^()]()$ # No parens (empty capture). ) /x; return $1 // (); }` [download] I think* that this, being Perl, grants us license to explore in the spirit of There is more than one way to do it. :) Dave	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks