Parsing balanced parentheses

budman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing balanced parentheses by GrandFather (Saint) on Jul 23, 2006 at 04:05 UTC
Seems there are a couple of ways of going about this depending on how much of the whole parsing problem this is. One way is to hand roll a recursive decent parser. Another way is to use Parse::RecDescent to generate the parser for you. If the data you have shown is representative of the worst case problem then hand rolling the parser is probably a reasonable approach. If it gets much more complex than that then using the module to generate a parser is probably a good idea. You might like to tell us something of the larger picture in case that hints at better ways to solve the larger problem rather than the fine focus on the parsing problem. The best approach to solving the parsing issue may well depend on what you want to do with the parsed result. Update: as an exercise I wrote the following code that may be a starting point for rolling parser: use strict; use warnings; use Data::Dump::Streamer; my $contents = <<'BLOCK'; create item "xxx" { remove entry "xxx" add entry "yy" "xxx" { item1=xxxx, item2=xxxx, } add diffenent entry "xxx" { item1=xxx, item2=xxxx, item3=xxx add subitem { item1=xxx, item2=xxx, item3=xxx } add subitem { item1=xxx, item2=xxx, } } another type { item1=xxx } with "xxx" } BLOCK my @tokens = split /\s+/, $contents; my @chunks; push @chunks, ExtractChunks (\@tokens); Dump (\@chunks); sub ExtractChunks { my $tokens = shift; my @chunks; while (@$tokens) { my $token = shift @$tokens; if ($token =~ /{/) { push @chunks, ExtractChunks ($tokens); } elsif ($token =~ /}/) { last; } else { push @chunks, $token; } } return \@chunks; } [download] Prints: `$ARRAY1 = [ [ 'create', 'item', '"xxx"', [ 'remove', 'entry', '"xxx"', 'add', 'entry', '"yy"', '"xxx"', [ 'item1=xxxx,', 'item2=xxxx,' ], 'add', 'diffenent', 'entry', '"xxx"', [ 'item1=xxx,', 'item2=xxxx,', 'item3=xxx', 'add', 'subitem', [ 'item1=xxx,', 'item2=xxx,', 'item3=xxx' ], 'add', 'subitem', [ 'item1=xxx,', 'item2=xxx,' ] ], 'another', 'type', [ 'item1=xxx' ], 'with', '"xxx"' ] ] ];` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Parsing balanced parentheses by budman (Sexton) on Aug 03, 2006 at 07:28 UTC
Thanks for your helpful suggestions. I'll look over all the points of interest. Sorry I haven't replied sooner, been busy with a new release and other issues. Unfortunately, the above is an active script file used by a few heavy duty C++ apps. I was asked to help improve the auditing system that generates reports from these script files. (it enables users to provide adjustments to the current data run) To get the script working, it currently uses the old match and set a flag :) to update a report. I implemented the updated code, which works descent - generates a more detailed spreadsheet capturing changes made in the last 5 years or so. :) It was one of those, when we get a chance... unfortunately, it took some auditors to get the ball rolling. All is well. The reason I asked, I know there has to be a better approach to the whole process. I would like to load it into a hash and eliminate the nested if's. At least it doesn't look like a rat nest now. I will look over the faqs and other suggested items. Thanks Again, Regards, budman	[reply]
Re: Parsing balanced parentheses by rodion (Chaplain) on Jul 23, 2006 at 11:32 UTC
The big problem you can run into with parsing balanced elements comes when one of your delimiters shows up within the quotes. What happens if one of the "xxx" items contains "the enclosing '{}'", or even, "the closing '}'". If you can be sure this will never be in the data, then the task is reasonably simple, whether you write your parser or learn how to build a grammer for one of the parsing modules. If there is any chance the data will contain your delimiter, you should go to Text::Balanced, and check the section with the example `$text = '<A HREF=">>>>">link</A>'; into ( '<A HREF=">>>>">', 'link</A>', "" )` [download] That module is also worth bing familiar with because the tasks it handles come up a lot in contexts where you don't want to switch over to thinking about a recursive grammar.	[reply] [d/l]
Re: Parsing balanced parentheses by planetscape (Chancellor) on Jul 23, 2006 at 05:09 UTC
I would recommend you take a look at Some Parse::RecDescent Tutorials - the links therein contain many examples. HTH, planetscape	[reply]
Re: Parsing balanced parentheses by eyepopslikeamosquito (Archbishop) on Jul 23, 2006 at 12:11 UTC
Three entries from perlfaq seem to be relevant: perlfaq4: How do I find matching/nesting anything? perlfaq6: Can I use Perl regular expressions to match balanced text? perlfaq6: How do I use a regular expression to strip C style comments from a file?	[reply]
Re^2: Parsing balanced parentheses by rodion (Chaplain) on Jul 23, 2006 at 17:10 UTC
The faqs are good, but I think the "How do I find matching/nesting anything?" in perlfaq4 should have something about handling the case when the delimiter may be present in a quoted string inside a matching set of delimiters. perlfaq4 mentions Text::Balanced, which talks about this case in the docs, but there's a good chance you will go use another module before you read about it, since it's 8 pages in. The case is a nasty little gotcha for applications like the OPs because it ususally doesn't show up on testing; the bug often waits in production code until you've forgotten what you wrote and how it works, then it blows up with a confusing error message.	[reply]
Re: Parsing balanced parentheses by fmerges (Chaplain) on Jul 23, 2006 at 10:44 UTC
Hi, There is also Parse::Yapp. Regards, fmerges at irc.freenode.net	[reply]
Re: Parsing balanced parentheses by I0 (Priest) on Jul 24, 2006 at 04:53 UTC
use strict; use warnings; use Data::Dumper; my $contents = qq( create item "xxx" { remove entry "}x{" add entry "yy" "xxx" { item1=xxxx, item2=xxxx, } add diffenent entry "}}xxx" { item1=xxx, item2=xxxx, item3=xxx add subitem { item1=xxx, item2=xxx, item3=xxx } add subitem { item1=xxx, item2=xxx, } } another type { item1=xxx } with "{{xxx" } ); (my $re=$contents)=~s/(({)\|(})\|"(\\.\|[^"\\])"\|.)/@{[')','']}[!$3]\Q$1 +\E@{['(','']}[!$2]/gs; print Dumper(@$=eval{$contents=~/$re/}); die $@ if $@; $re=join"\|",map{"\Q$_\E"}@$; print Dumper extract($contents); sub extract{ my $c=shift; my @r; while( $c =~ /(\S.?)\s(?:{($re)} (.*)\|$)/mg ){ my($h,$b,$t)=($1,$2,$3); push @r,$b?[$h,extract($b),$t\|\|()]:$h; } return @r?[@r]:$c; } [download]	[reply] [d/l]


Just another Perl shrine
	PerlMonks