Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Parsing balanced parentheses

by budman (Sexton)
on Jul 23, 2006 at 03:04 UTC ( [id://563072]=perlquestion: print w/replies, xml ) Need Help??

budman has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am wondering if I am heading in the right direction. I am trying to parse a file that looks similar to blocks {} in perl. Example:
my $contents = qq( create item "xxx" { remove entry "xxx" add entry "yy" "xxx" { item1=xxxx, item2=xxxx, } add diffenent entry "xxx" { item1=xxx, item2=xxxx, item3=xxx add subitem { item1=xxx, item2=xxx, item3=xxx } add subitem { item1=xxx, item2=xxx, } } another type { item1=xxx } with "xxx" } );
This is a bit tricky with all the differnt blocks. I started by splitting on newline into an array, then iterating the lines.
# I remove the C-style comments prior to this my @lines = split(/\n/,$list); my $level = 0; for (@lines) { print "$level:$_\n"; $level++ if /\{/; $level-- if /\}/; }
I remember doing something like this with recursion function when traversing directories. Any ideas what to try? Thanks.

2006-07-23 Retitled by g0n, as per Monastery guidelines
Original title: 'parsing'

Replies are listed 'Best First'.
Re: Parsing balanced parentheses
by GrandFather (Saint) on Jul 23, 2006 at 04:05 UTC

    Seems there are a couple of ways of going about this depending on how much of the whole parsing problem this is. One way is to hand roll a recursive decent parser. Another way is to use Parse::RecDescent to generate the parser for you. If the data you have shown is representative of the worst case problem then hand rolling the parser is probably a reasonable approach. If it gets much more complex than that then using the module to generate a parser is probably a good idea.

    You might like to tell us something of the larger picture in case that hints at better ways to solve the larger problem rather than the fine focus on the parsing problem. The best approach to solving the parsing issue may well depend on what you want to do with the parsed result.

    Update: as an exercise I wrote the following code that may be a starting point for rolling parser:

    use strict; use warnings; use Data::Dump::Streamer; my $contents = <<'BLOCK'; create item "xxx" { remove entry "xxx" add entry "yy" "xxx" { item1=xxxx, item2=xxxx, } add diffenent entry "xxx" { item1=xxx, item2=xxxx, item3=xxx add subitem { item1=xxx, item2=xxx, item3=xxx } add subitem { item1=xxx, item2=xxx, } } another type { item1=xxx } with "xxx" } BLOCK my @tokens = split /\s+/, $contents; my @chunks; push @chunks, ExtractChunks (\@tokens); Dump (\@chunks); sub ExtractChunks { my $tokens = shift; my @chunks; while (@$tokens) { my $token = shift @$tokens; if ($token =~ /{/) { push @chunks, ExtractChunks ($tokens); } elsif ($token =~ /}/) { last; } else { push @chunks, $token; } } return \@chunks; }

    Prints:

    $ARRAY1 = [ [ 'create', 'item', '"xxx"', [ 'remove', 'entry', '"xxx"', 'add', 'entry', '"yy"', '"xxx"', [ 'item1=xxxx,', 'item2=xxxx,' ], 'add', 'diffenent', 'entry', '"xxx"', [ 'item1=xxx,', 'item2=xxxx,', 'item3=xxx', 'add', 'subitem', [ 'item1=xxx,', 'item2=xxx,', 'item3=xxx' ], 'add', 'subitem', [ 'item1=xxx,', 'item2=xxx,' ] ], 'another', 'type', [ 'item1=xxx' ], 'with', '"xxx"' ] ] ];

    DWIM is Perl's answer to Gödel

      Thanks for your helpful suggestions. I'll look over all the points of interest.

      Sorry I haven't replied sooner, been busy with a new release and other issues. Unfortunately, the above is an active script file used by a few heavy duty C++ apps. I was asked to help improve the auditing system that generates reports from these script files. (it enables users to provide adjustments to the current data run)

      To get the script working, it currently uses the old match and set a flag :) to update a report. I implemented the updated code, which works descent - generates a more detailed spreadsheet capturing changes made in the last 5 years or so. :) It was one of those, when we get a chance... unfortunately, it took some auditors to get the ball rolling. All is well.

      The reason I asked, I know there has to be a better approach to the whole process. I would like to load it into a hash and eliminate the nested if's. At least it doesn't look like a rat nest now.

      I will look over the faqs and other suggested items.

      Thanks Again, Regards, budman

Re: Parsing balanced parentheses
by rodion (Chaplain) on Jul 23, 2006 at 11:32 UTC
    The big problem you can run into with parsing balanced elements comes when one of your delimiters shows up within the quotes. What happens if one of the "xxx" items contains "the enclosing '{}'", or even, "the closing '}'". If you can be sure this will never be in the data, then the task is reasonably simple, whether you write your parser or learn how to build a grammer for one of the parsing modules.

    If there is any chance the data will contain your delimiter, you should go to Text::Balanced, and check the section with the example

    $text = '<A HREF=">>>>">link</A>'; into ( '<A HREF=">>>>">', 'link</A>', "" )
    That module is also worth bing familiar with because the tasks it handles come up a lot in contexts where you don't want to switch over to thinking about a recursive grammar.
Re: Parsing balanced parentheses
by planetscape (Chancellor) on Jul 23, 2006 at 05:09 UTC
Re: Parsing balanced parentheses
by eyepopslikeamosquito (Archbishop) on Jul 23, 2006 at 12:11 UTC

    Three entries from perlfaq seem to be relevant:

    • perlfaq4: How do I find matching/nesting anything?
    • perlfaq6: Can I use Perl regular expressions to match balanced text?
    • perlfaq6: How do I use a regular expression to strip C style comments from a file?

      The faqs are good, but I think the "How do I find matching/nesting anything?" in perlfaq4 should have something about handling the case when the delimiter may be present in a quoted string inside a matching set of delimiters. perlfaq4 mentions Text::Balanced, which talks about this case in the docs, but there's a good chance you will go use another module before you read about it, since it's 8 pages in.

      The case is a nasty little gotcha for applications like the OPs because it ususally doesn't show up on testing; the bug often waits in production code until you've forgotten what you wrote and how it works, then it blows up with a confusing error message.

Re: Parsing balanced parentheses
by fmerges (Chaplain) on Jul 23, 2006 at 10:44 UTC

    Hi,

    There is also Parse::Yapp.

    Regards,

    fmerges at irc.freenode.net
Re: Parsing balanced parentheses
by I0 (Priest) on Jul 24, 2006 at 04:53 UTC
    use strict; use warnings; use Data::Dumper; my $contents = qq( create item "xxx" { remove entry "}x{" add entry "yy" "xxx" { item1=xxxx, item2=xxxx, } add diffenent entry "}}xxx" { item1=xxx, item2=xxxx, item3=xxx add subitem { item1=xxx, item2=xxx, item3=xxx } add subitem { item1=xxx, item2=xxx, } } another type { item1=xxx } with "{{xxx" } ); (my $re=$contents)=~s/(({)|(})|"(\\.|[^"\\])*"|.)/@{[')','']}[!$3]\Q$1 +\E@{['(','']}[!$2]/gs; print Dumper(@$=eval{$contents=~/$re/}); die $@ if $@; $re=join"|",map{"\Q$_\E"}@$; print Dumper extract($contents); sub extract{ my $c=shift; my @r; while( $c =~ /(\S.*?)\s*(?:{($re)} *(.*)|$)/mg ){ my($h,$b,$t)=($1,$2,$3); push @r,$b?[$h,extract($b),$t||()]:$h; } return @r?[@r]:$c; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://563072]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2024-04-16 14:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found