Though you might be able to parse this with a fancy regex, a simple state machine would in my opinion be easier to read and less complicated to write and debug. If I understand your post correctly, your string has four tokens:
- quoted strings - opaque between the quotes
- runs of non-whitespace/non-close parenthesis that begin with anything but an open parenthesis or double quote, /[^)("][^)\s]*/.
- runs of whitespace used as a separator between the first two types of tokens
- parenthesized strings that may contain any of the first three types of tokens.
Assuming there are no parenthesized tokens within parenthesized tokens, you could use something like this:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
# store tokens other than separators
my @aTokens;
# state: are we inside or outside of a parenthesized token?
my $bParen;
my $sInParens='';
while ($line =~ /("[^"]+"|\(|\)|[^)\s]+|\s+)/g) {
my $sToken = $1;
if ($sToken eq '(') {
#starting a parenthesized token
$bParen=1;
} elsif ($sToken eq ')') {
#ending a parenthesized token: add it to the list
$bParen=0;
push @aTokens, "($sInParens)";
$sInParens='';
} elsif ($bParen) {
# in the middle of a parenthesized token
$sInParens .= $sToken;
} elsif ($sToken =~ /^\S/) {
# not a parenthesized token
# either a quoted or unquoted non-whitespace token
# add it to the list
push @aTokens, $sToken;
}
}
local $"='> <';
printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>";
}
__DATA__
xxx "()" ("charset" "ISO-8859-1") (")") "xxx"
If you also need parenthesized tokens within parenthesized tokens, they the loop is only slightly more complicated. You would need to change the flag $bParen to a counter that was incremented for each '(' and decremented for each ')' found. You would then build the token until $iParenCount returned to 0. Parentheses within quotes will have no effect on this count because the "[^"] run insures that only parentheses outside of quotes will get parsed into separate tokens:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
my @aTokens;
my $sInParens='';
my $iParenCount;
while ($line =~ /("[^"]+"|\(|\)|[^)\s]*|\s+)/g) {
my $sToken = $1;
if ($sToken eq '(') {
if ($iParenCount) {
$sInParens .= $sToken;
}
$iParenCount++;
} elsif ($sToken eq ')') {
$iParenCount--;
if ($iParenCount) {
$sInParens .= $sToken;
} else {
push @aTokens, "($sInParens)";
$sInParens='';
}
} elsif ($iParenCount) {
$sInParens .= $sToken;
} elsif ($sToken =~ /^\S/) {
push @aTokens, $sToken;
}
}
local $"='> <';
print "paren count: $iParenCount\n";
printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>\n";
}
__DATA__
xxx "()" ("charset" "ISO-8859-1") (")") "xxx" ((a)(b)(c)) yyy
Best, beth
Update: added some discussion about handling nested parenthesized tokens.
Update: Fixed overly greedy regex