Though you might be able to parse this with a fancy regex, a simple state machine would in my opinion be easier to read and less complicated to write and debug. If I understand your post correctly, your string has four tokens:
- quoted strings - opaque between the quotes
- runs of non-whitespace/non-close parenthesis that begin with anything but an open parenthesis or double quote, /[^)("][^)\s]*/.
- runs of whitespace used as a separator between the first two types of tokens
- parenthesized strings that may contain any of the first three types of tokens.
Assuming there are no parenthesized tokens within parenthesized tokens, you could use something like this:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
# store tokens other than separators
my @aTokens;
# state: are we inside or outside of a parenthesized token?
my $bParen;
my $sInParens='';
while ($line =~ /("[^"]+"|\(|\)|[^)\s]+|\s+)/g) {
my $sToken = $1;
if ($sToken eq '(') {
#starting a parenthesized token
$bParen=1;
} elsif ($sToken eq ')') {
#ending a parenthesized token: add it to the list
$bParen=0;
push @aTokens, "($sInParens)";
$sInParens='';
} elsif ($bParen) {
# in the middle of a parenthesized token
$sInParens .= $sToken;
} elsif ($sToken =~ /^\S/) {
# not a parenthesized token
# either a quoted or unquoted non-whitespace token
# add it to the list
push @aTokens, $sToken;
}
}
local $"='> <';
printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>";
}
__DATA__
xxx "()" ("charset" "ISO-8859-1") (")") "xxx"
If you also need parenthesized tokens within parenthesized tokens, they the loop is only slightly more complicated. You would need to change the flag $bParen to a counter that was incremented for each '(' and decremented for each ')' found. You would then build the token until $iParenCount returned to 0. Parentheses within quotes will have no effect on this count because the "[^"] run insures that only parentheses outside of quotes will get parsed into separate tokens:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
my @aTokens;
my $sInParens='';
my $iParenCount;
while ($line =~ /("[^"]+"|\(|\)|[^)\s]*|\s+)/g) {
my $sToken = $1;
if ($sToken eq '(') {
if ($iParenCount) {
$sInParens .= $sToken;
}
$iParenCount++;
} elsif ($sToken eq ')') {
$iParenCount--;
if ($iParenCount) {
$sInParens .= $sToken;
} else {
push @aTokens, "($sInParens)";
$sInParens='';
}
} elsif ($iParenCount) {
$sInParens .= $sToken;
} elsif ($sToken =~ /^\S/) {
push @aTokens, $sToken;
}
}
local $"='> <';
print "paren count: $iParenCount\n";
printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>\n";
}
__DATA__
xxx "()" ("charset" "ISO-8859-1") (")") "xxx" ((a)(b)(c)) yyy
Best, beth
Update: added some discussion about handling nested parenthesized tokens.
Update: Fixed overly greedy regex
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|