Re: Extracting a parenthesized fragment from a string

Though you might be able to parse this with a fancy regex, a simple state machine would in my opinion be easier to read and less complicated to write and debug. If I understand your post correctly, your string has four tokens:

quoted strings - opaque between the quotes
runs of non-whitespace/non-close parenthesis that begin with anything but an open parenthesis or double quote, /[^)("][^)\s]*/.
runs of whitespace used as a separator between the first two types of tokens
parenthesized strings that may contain any of the first three types of tokens.

Assuming there are no parenthesized tokens within parenthesized tokens, you could use something like this:

use strict;
use warnings;

while (my $line = <DATA>) {
  chomp $line;

  # store tokens other than separators
  my @aTokens;

  # state: are we inside or outside of a parenthesized token?
  my $bParen;
  my $sInParens='';

  while ($line =~ /("[^"]+"|\(|\)|[^)\s]+|\s+)/g) {
    my $sToken = $1;
    if ($sToken eq '(') {
      #starting a parenthesized token
      $bParen=1;
    } elsif ($sToken eq ')') {
      #ending a parenthesized token: add it to the list
      $bParen=0;
      push @aTokens, "($sInParens)";
      $sInParens='';
    } elsif ($bParen) {
      # in the middle of a parenthesized token
      $sInParens .= $sToken;
    } elsif ($sToken =~ /^\S/) {
      # not a parenthesized token
      # either a quoted or unquoted non-whitespace token
      # add it to the list
      push @aTokens, $sToken;
    }
  }
  local $"='> <';
  printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>";
}

__DATA__
xxx "()" ("charset" "ISO-8859-1") (")") "xxx"
[download]

If you also need parenthesized tokens within parenthesized tokens, they the loop is only slightly more complicated. You would need to change the flag $bParen to a counter that was incremented for each '(' and decremented for each ')' found. You would then build the token until $iParenCount returned to 0. Parentheses within quotes will have no effect on this count because the "[^"] run insures that only parentheses outside of quotes will get parsed into separate tokens:

use strict;
use warnings;

while (my $line = <DATA>) {
  chomp $line;
  my @aTokens;
  my $sInParens='';
  my $iParenCount;

  while ($line =~ /("[^"]+"|\(|\)|[^)\s]*|\s+)/g) {
    my $sToken = $1;
    if ($sToken eq '(') {
      if ($iParenCount) {
        $sInParens .= $sToken;
      }
      $iParenCount++;
    } elsif ($sToken eq ')') {
      $iParenCount--;
      if ($iParenCount) {
        $sInParens .= $sToken;
      } else {
        push @aTokens, "($sInParens)";
        $sInParens='';
      }
    } elsif ($iParenCount) {
      $sInParens .= $sToken;
    } elsif ($sToken =~ /^\S/) {
      push @aTokens, $sToken;
    }
  }
  local $"='> <';
  print "paren count: $iParenCount\n";
  printf "input : %s\n%s", "<$line>", "tokens: <@aTokens>\n";
}

__DATA__
xxx "()" ("charset" "ISO-8859-1") (")") "xxx" ((a)(b)(c)) yyy
[download]

Best, beth

Update: added some discussion about handling nested parenthesized tokens.

Update: Fixed overly greedy regex

Comment on Re: Extracting a parenthesized fragment from a string Select or Download Code


We don't bite newbies here... much
	PerlMonks