http://qs321.pair.com?node_id=794756


in reply to Extracting a parenthesized fragment from a string

Though you might be able to parse this with a fancy regex, a simple state machine would in my opinion be easier to read and less complicated to write and debug. If I understand your post correctly, your string has four tokens:

Assuming there are no parenthesized tokens within parenthesized tokens, you could use something like this:

use strict; use warnings; while (my $line = <DATA>) { chomp $line; # store tokens other than separators my @aTokens; # state: are we inside or outside of a parenthesized token? my $bParen; my $sInParens=''; while ($line =~ /("[^"]+"|\(|\)|[^)\s]+|\s+)/g) { my $sToken = $1; if ($sToken eq '(') { #starting a parenthesized token $bParen=1; } elsif ($sToken eq ')') { #ending a parenthesized token: add it to the list $bParen=0; push @aTokens, "($sInParens)"; $sInParens=''; } elsif ($bParen) { # in the middle of a parenthesized token $sInParens .= $sToken; } elsif ($sToken =~ /^\S/) { # not a parenthesized token # either a quoted or unquoted non-whitespace token # add it to the list push @aTokens, $sToken; } } local $"='> <'; printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>"; } __DATA__ xxx "()" ("charset" "ISO-8859-1") (")") "xxx"

If you also need parenthesized tokens within parenthesized tokens, they the loop is only slightly more complicated. You would need to change the flag $bParen to a counter that was incremented for each '(' and decremented for each ')' found. You would then build the token until $iParenCount returned to 0. Parentheses within quotes will have no effect on this count because the "[^"] run insures that only parentheses outside of quotes will get parsed into separate tokens:

use strict; use warnings; while (my $line = <DATA>) { chomp $line; my @aTokens; my $sInParens=''; my $iParenCount; while ($line =~ /("[^"]+"|\(|\)|[^)\s]*|\s+)/g) { my $sToken = $1; if ($sToken eq '(') { if ($iParenCount) { $sInParens .= $sToken; } $iParenCount++; } elsif ($sToken eq ')') { $iParenCount--; if ($iParenCount) { $sInParens .= $sToken; } else { push @aTokens, "($sInParens)"; $sInParens=''; } } elsif ($iParenCount) { $sInParens .= $sToken; } elsif ($sToken =~ /^\S/) { push @aTokens, $sToken; } } local $"='> <'; print "paren count: $iParenCount\n"; printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>\n"; } __DATA__ xxx "()" ("charset" "ISO-8859-1") (")") "xxx" ((a)(b)(c)) yyy

Best, beth

Update: added some discussion about handling nested parenthesized tokens.

Update: Fixed overly greedy regex

Replies are listed 'Best First'.
Re^2: Extracting a parenthesized fragment from a string
by fce2 (Sexton) on Sep 11, 2009 at 12:42 UTC

    I very much like your analysis and approach to this. I'd spent so long trying to twist this regex to my will that I'd forgotten that there's other tools in the shed!

    I'll take this and fit it into my code and see how it goes. Thanks so much!