http://qs321.pair.com?node_id=15867

swiftone has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a parser for a specified format (so I'm stuck with the format). I have no doubt this will lead to many questions, but here's my first:

Given a string of comma separated elements, where an element can contain a function, and functions can have commas in their arguments, how do I best grab the elements?

After looking over Merlyn's nested C comment parser and The CSV parser from Mastering Regex, I have a working solution. I'm not convinced, however, that this is the easiest/best way to do it. Comments?

#!/usr/bin/perl $teststr="blah,blah(blah,blah(blah,blah(blah))),blah"; #This is three elements: # blah # blah(blah,blah(blah,blah(blah))) # blah # I don't have to worry about escaped parens, the file format forbids +it. foreach (&parse_comma($teststr)){ print "$_\n"; #This just proves that it works } sub parse_comma{ my $commastr=shift; my @tags; my $count=0; my $carrystr=""; foreach (split(/,/, $commastr)){ $_=$carrystr.",".$_ if $carrystr; $count=s/\(/(/g; $count-=s/\)/)/g; if($count){ $carrystr=$_; }else{ $carrystr=""; push @tags, $_; } } return @tags; }

Replies are listed 'Best First'.
Re: Balancing Parens
by lhoward (Vicar) on Jun 01, 2000 at 22:42 UTC
    Have you considered using Parse::RecDescent? It implements a full-featured recursive-descent parser. A real parser (as opposed to parsing a string with a regular expression alone) is much more powerful and can be more apropriate for parsing highly structured/nested data like you have. I'm not sure exactly what you want to do with the line after you parse it, so my example below does't do anything with the data it parses, but it should be a good starting point if you want to try using Parse::RecDescent to parse your data. (it has been a while since I've written a grammer so it may look a bit rough).
    use Parse::RecDescent; my $teststr="blah1,blah2(blah3,blah4(blah5,blah6(blah7))),blah8"; my $grammar = q { content: /[^\)\(\,]+/ function: content '(' list ')' value: content item: function | value list: item ',' list | item startrule: list }; my $parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n"; defined $parser->startrule($teststr) or print "Bad text!\n";
      Simplifying the grammar, we get:
      use Parse::RecDescent; my $teststr="blah1,blah2(blah3,blah4(blah5,blah6(blah7))),blah8"; my $grammar = q { list: <leftop: item ',' item> item: word '(' list ')' <commit> | word word: /\w+/ }; my $parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n"; + defined $parser->list($teststr) or print "Bad text!\n";

      -- Randal L. Schwartz, Perl hacker

      Thank you, this appears to be just what I was looking for. It may not be more efficient for this first part, but it looks like it can do 90% of the parsing (of the entire format, not just this one part) for me. I've never worked with yacc-like parsers, so this will be a new experiment for me. Once again, thanks!
        If you've never worked with parsers swifie, check out the antipodean wizard Damian Conway's article in TPJ on Parse::RecDecent entitled The man(1) of descent. At 13 pages, this must be the longest article ever in TPJ!
Re: Balancing Parens
by Anonymous Monk on Aug 17, 2000 at 10:12 UTC
    $_ = "blah,blah(blah,blah(blah,blah(blah))),blah";
    #$_="blah1,blah2(blah3,blah4(blah5,blah6(blah7))),blah8";
    ($re=$_)=~s/((\()|(\))|.)/$2\Q$1\E$3/gs;
    @$ = (eval{/$re/});
    die $@ if $@=~/unmatched/;
    $re = join'|',map{quotemeta}@$;
    print join"\n",/((?:$re|[^,])+)/g;
    
Re: Balancing Parens
by KM (Priest) on Jun 01, 2000 at 22:25 UTC
    Well, I don't know what the real data may look like, but this works for me with your $teststr:

    $teststr="blah,blah(blah,blah(blah,blah(blah))),blah"; if ($teststr =~ /^(\w*),(.*?),(\w*)$/) { print "1: $1\n2: $2\n3: $3\n"; }

    Cheers,
    KM

      Ah, I should have been more specific. The real data can have a variable number of elements. Thanks anyway.
        Well, be more specific. Show examples of the actual possible data, no pseudo-data that won't look like the actual data. Give us some test cases.

        Cheers,
        KM