Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Extracting a parenthesized fragment from a string

by ELISHEVA (Prior)
on Sep 11, 2009 at 11:23 UTC ( #794756=note: print w/replies, xml ) Need Help??


in reply to Extracting a parenthesized fragment from a string

Though you might be able to parse this with a fancy regex, a simple state machine would in my opinion be easier to read and less complicated to write and debug. If I understand your post correctly, your string has four tokens:
  • quoted strings - opaque between the quotes
  • runs of non-whitespace/non-close parenthesis that begin with anything but an open parenthesis or double quote, /[^)("][^)\s]*/.
  • runs of whitespace used as a separator between the first two types of tokens
  • parenthesized strings that may contain any of the first three types of tokens.

Assuming there are no parenthesized tokens within parenthesized tokens, you could use something like this:

use strict; use warnings; while (my $line = <DATA>) { chomp $line; # store tokens other than separators my @aTokens; # state: are we inside or outside of a parenthesized token? my $bParen; my $sInParens=''; while ($line =~ /("[^"]+"|\(|\)|[^)\s]+|\s+)/g) { my $sToken = $1; if ($sToken eq '(') { #starting a parenthesized token $bParen=1; } elsif ($sToken eq ')') { #ending a parenthesized token: add it to the list $bParen=0; push @aTokens, "($sInParens)"; $sInParens=''; } elsif ($bParen) { # in the middle of a parenthesized token $sInParens .= $sToken; } elsif ($sToken =~ /^\S/) { # not a parenthesized token # either a quoted or unquoted non-whitespace token # add it to the list push @aTokens, $sToken; } } local $"='> <'; printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>"; } __DATA__ xxx "()" ("charset" "ISO-8859-1") (")") "xxx"

If you also need parenthesized tokens within parenthesized tokens, they the loop is only slightly more complicated. You would need to change the flag $bParen to a counter that was incremented for each '(' and decremented for each ')' found. You would then build the token until $iParenCount returned to 0. Parentheses within quotes will have no effect on this count because the "[^"] run insures that only parentheses outside of quotes will get parsed into separate tokens:

use strict; use warnings; while (my $line = <DATA>) { chomp $line; my @aTokens; my $sInParens=''; my $iParenCount; while ($line =~ /("[^"]+"|\(|\)|[^)\s]*|\s+)/g) { my $sToken = $1; if ($sToken eq '(') { if ($iParenCount) { $sInParens .= $sToken; } $iParenCount++; } elsif ($sToken eq ')') { $iParenCount--; if ($iParenCount) { $sInParens .= $sToken; } else { push @aTokens, "($sInParens)"; $sInParens=''; } } elsif ($iParenCount) { $sInParens .= $sToken; } elsif ($sToken =~ /^\S/) { push @aTokens, $sToken; } } local $"='> <'; print "paren count: $iParenCount\n"; printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>\n"; } __DATA__ xxx "()" ("charset" "ISO-8859-1") (")") "xxx" ((a)(b)(c)) yyy

Best, beth

Update: added some discussion about handling nested parenthesized tokens.

Update: Fixed overly greedy regex

Replies are listed 'Best First'.
Re^2: Extracting a parenthesized fragment from a string
by fce2 (Sexton) on Sep 11, 2009 at 12:42 UTC

    I very much like your analysis and approach to this. I'd spent so long trying to twist this regex to my will that I'd forgotten that there's other tools in the shed!

    I'll take this and fit it into my code and see how it goes. Thanks so much!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://794756]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2022-08-15 01:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?