Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Reversible parsing (with Parse::RecDescent?)

by goibhniu (Hermit)
on Mar 14, 2008 at 19:55 UTC ( [id://674283]=perlquestion: print w/replies, xml ) Need Help??

goibhniu has asked for the wisdom of the Perl Monks concerning the following question:

My concept is to wrap some configuration management command lines in a Perl framework to expose the command options along with other configuration data in a standard configuration file. For the design, I've "drawn" these boxes in my head (and now in ASCII art):

/--------------\ /------------\ /-----------\ /------------\ / +--------\ | | | | | | | | | + | | command line |-->| Parse:: |-->| Data |-->| Data:: |-->| + XML | | string |<--| RecDescent |<--| structure |<--| Serializer |<--| + FILE | | | | | | | | | | + | /--------------\ /------------\ /-----------\ /------------\ / +--------\ ^ | /------------\ | | | command | | line | | grammar | | | /------------\

In the proof-of-concept stage, I've had some successes and some setbacks. PerlMonks++ to Niel Neely, the maintainer of Data::Serializer, for responding VERY quickly to a bug. Data::Serializer 0.44 includes this patch and lets me use XML::Dumper in raw mode with store and retrieve. Niel was very responsive and the patch is working great.

On the Parse::RecDescent side, I read the tutorial and wrote a simple grammar, passed alot of command line strings through it successfully to get the data structure to serialize, serialized it and deserialized it. I got stuck when I realized that Parse::RecDescent::Deparse didn't deparse my data structure into a command line string, but instead deparsed my parser into a grammar (should've rtfm before I got this far, I guess)!

Now I could write a function to take my datastructure and turn it into a command line string, but it seems a shame that I've gone to the trouble of writing a grammar to define how to go one direction, but can't use the same grammar to go the other direction. Especially if I intend to change and expand the grammar, I would like that to be the end of it and not have to also change the deparsing function.

So, now I have some questions:
Academic questions:

  • Is it necessarily possible to reverse the data through the grammar to get the command line?
  • or is this only possible for special grammars?
  • Is there something I need to in my grammar to make this possible?
Design decisions:
  • Is there a module that already reverses the parsing of Parse::RecDescent parsers?
  • Is there another module that makes this bi-directionality easier (I'd prefer not to learn another grammar specification syntax, but could)?
  • Is there a module that already does all this soup-to-nuts?

And finally, the infamous "what I've already done". This is grammarharness.pl:

#/usr/bin/perl -Wl use strict; use warnings; use Parse::RecDescent; use Data::Dumper; use Data::Serializer; my $grammar; my $grammarfile = $ARGV[0]; open my $gf,'<',$grammarfile or die 'bad grammarfile'; while (<$gf>) { $grammar.=$_; } close $gf; print '$grammar is ',"\n",$grammar; $::RD_HINT++; #$::RD_TRACE++; my $parser = new Parse::RecDescent ($grammar) or die 'bad grammar'; my $stringtoparsefile = $ARGV[1]; open my $sf,'<',$stringtoparsefile or die 'bad stringstoparse file'; my $result; my $reresult; #Data::Serializer 0.44 supports raw=> my $serializer = Data::Serializer->new(raw=>'1',serializer => 'XML::Du +mper'); my $outfilebasename = 'data'; my $outfileext = 'out'; my $outfileindex = 10; while (<$sf>) { print 'parsing $_: ',$_; if (defined $parser->startrule($_)){ $result = $parser->startrule($_); print Dumper($result); $serializer->store( $result, $outfilebasename.$outfileindex.'.'.$outfileext, '>>' ); $reresult = $serializer->retrieve( $outfilebasename.$outfileindex++.'.'.$outfileext, ); print Dumper($reresult); }else{ print " badstring\n"} }

This is packagegrammar.txt:
startrule: commandline{$item{commandline}} commandline: command options {[$item {command},$item{options}]} command: packagecommand {$item{packagecommand}} packagecommand: /(hap)/ | /(hccmrg)/ | /(hup)/ | /(hspp)/ | /(hpp)/ | +/(hpg)/ | /(hdp)/ | /(hdlp)/ | /(hcp)/ options: option(s?) option: optionflag optionvalue { [$item{optionflag}, $item{optionvalue +}?$item{optionvalue}:1] } optionflag: /(-\w+)/ { $1 } optionvalue: /(\w*)/ { $1 }

This is packagecommands.txt:
hap -b hap -b sknxharvest01 hap -b sknxharvest01 -enc testfile.dfo hap -b sknxharvest01 -usr cgowing -pass chaspass hap -badflag hap -prompt hap -b sknxharvest01 -prompt hap -prompt -b sknxharvest01

how it works (nothing particularly wrong here; it mostly proves all the working parts without attempting the deparse):
C:\chas_sandbox\grammars>grammarharness.pl packagegrammar.txt packagec +ommands.tx t $grammar is startrule: commandline{$item{commandline}} commandline: command options {[$item {command},$item{options}]} command: packagecommand {$item{packagecommand}} packagecommand: /(hap)/ | /(hccmrg)/ | /(hup)/ | /(hspp)/ | /(hpp)/ | +/(hpg)/ | /(hdp)/ | /(hdlp)/ | /(hcp)/ options: option(s?) option: optionflag optionvalue { [$item{optionflag}, $item{optionvalue +}?$item{op tionvalue}:1] } optionflag: /(-\w+)/ { $1 } optionvalue: /(\w*)/ { $1 } parsing $_: hap -b $VAR1 = [ 'hap', [ [ '-b', 1 ] ] ]; $VAR1 = [ 'hap', [ [ '-b', '1' ] ] ]; parsing $_: hap -b sknxharvest01 $VAR1 = [ 'hap', [ [ '-b', 'sknxharvest01' ] ] ]; $VAR1 = [ 'hap', [ [ '-b', 'sknxharvest01' ] ] ]; parsing $_: hap -b sknxharvest01 -enc testfile.dfo $VAR1 = [ 'hap', [ [ '-b', 'sknxharvest01' ], [ '-enc', 'testfile' ] ] ]; $VAR1 = [ 'hap', [ [ '-b', 'sknxharvest01' ], [ '-enc', 'testfile' ] ] ]; parsing $_: hap -b sknxharvest01 -usr cgowing -pass chaspass $VAR1 = [ 'hap', [ [ '-b', 'sknxharvest01' ], [ '-usr', 'cgowing' ], [ '-pass', 'chaspass' ] ] ]; $VAR1 = [ 'hap', [ [ '-b', 'sknxharvest01' ], [ '-usr', 'cgowing' ], [ '-pass', 'chaspass' ] ] ]; parsing $_: hap -badflag $VAR1 = [ 'hap', [ [ '-badflag', 1 ] ] ]; $VAR1 = [ 'hap', [ [ '-badflag', '1' ] ] ]; parsing $_: hap -prompt $VAR1 = [ 'hap', [ [ '-prompt', 1 ] ] ]; $VAR1 = [ 'hap', [ [ '-prompt', '1' ] ] ]; parsing $_: hap -b sknxharvest01 -prompt $VAR1 = [ 'hap', [ [ '-b', 'sknxharvest01' ], [ '-prompt', 1 ] ] ]; $VAR1 = [ 'hap', [ [ '-b', 'sknxharvest01' ], [ '-prompt', '1' ] ] ]; parsing $_: hap -prompt -b sknxharvest01 $VAR1 = [ 'hap', [ [ '-prompt', 1 ], [ '-b', 'sknxharvest01' ] ] ]; $VAR1 = [ 'hap', [ [ '-prompt', '1' ], [ '-b', 'sknxharvest01' ] ] ]; C:\chas_sandbox\grammars>type data10.xml <perldata> <arrayref memory_address="0xdee0b8"> <item key="0">hap</item> <item key="1"> <arrayref memory_address="0xd67300"> <item key="0"> <arrayref memory_address="0xdee28c"> <item key="0">-b</item> <item key="1">1</item> </arrayref> </item> </arrayref> </item> </arrayref> </perldata> C:\chas_sandbox\grammars>


#my sig used to say 'I humbly seek wisdom. '. Now it says:
use strict;
use warnings;
I humbly seek wisdom.

Replies are listed 'Best First'.
Re: Reversible parsing (with Parse::RecDescent?)
by ikegami (Patriarch) on Mar 14, 2008 at 22:18 UTC

    I haven't thought of this topic much before you brought it up and I haven't researched it, so I'm making it up as I go along. However, I do have experience with parsing and with P::RD specifically, so hopefully you'll find it useful.


    • Is it necessarily possible to reverse the data through the grammar to get the command line?
    • or is this only possible for special grammars?
    • Is there something I need to in my grammar to make this possible?

    Yes, it's possible to deparse a parse tree, if the parse tree contains sufficient information. That means that your parser must return sufficient information to regenerate an acceptably similar source. (e.g. If your language accepts single-quoted and double-quoted string literals, you may not care which one is used in the deparsed text as long as the resulting text is equivalent to the original. You probably don't care about whitespace and comments either.) With P::RD, that is controlled by the grammar.

    • Is there a module that already reverses the parsing of Parse::RecDescent parsers?
    • Is there another module that makes this bi-directionality easier (I'd prefer not to learn another grammar specification syntax, but could)?
    • Is there a module that already does all this soup-to-nuts?

    I'm afraid the grammar couldn't be the same as a the one used by P::RD. Actions ({ ... }) and many directives (<...>) cannot be automatically reversed. Error checks would be different. etc.

    Now, something like BNF could be automatically reversed, so you could write a parser and deparser that accepted pure BNF. The problems with this approach is that the product returned by the parser would be of limited use. It would require further processing. This is what the P::RD lumps in with the grammar, and that's what's not automatically reversible.


    Next, let's address some issues with the grammar.

    Bugs:

    • Use of captures is undocumented and unnecessary. Use "{ $item[1] }" instead of captures and "{ $1 }".
    • "/ident/" incorrectly matches the start of "identx".
    • You don't check if the end of input was reached, so trailing errors in your input can be ignored. In fact, it happens in one of your test cases: ".dfo" is silently ignored in "hap -b sknxharvest01 -enc testfile.dfo". Note the "/\Z/" I added.
    • A common mistake is to assume the strict and warnings settings from the caller are used for the parser. You need to include these in the code block in the grammar if you want them.

    Sketchy Code:

    • No reason to compile the grammar every time.
    • "rule : subrule { $item[1] }" is the same as "rule : subrule".
    • Each rule invoked adds a fair bit of time to the overhead when using P::RD, so I inlined the trivial ones.
    • Something that's optional should be made optional by a rule, not by matching a zero-length string. (Re: optvalue)

    Stylistic:

    • I renamed "startrule" to "parse" since I find "$parser->parse($text)" appealing.
    • Shortened "command", "pacakge" and "option" to their well known abbreviations: "cmd", "pkg" and "opt".
    • I seperated token rules from other rules and uppercased them. P::RD doesn't distinguish between tokens and higher-level rules, but there tends to be subtle differences between them. For example, I tend to avoid including $item[0] in the returned value for tokens, while I tend to include it for other rules.
    • I align : and |. I find this helps readability a lot.
    # build_parser.pl use strict; use warnings; use Parse::RecDescent qw( ); my $grammar = <<'__END_OF_GRAMMAR__'; { # Lexical pragmas and lexcial variables # here are in the same scope as the sub # created for each grammar rule. use strict; use warnings; my %KNOWN_COMMANDS = map { $_ => 1 } qw( hap hccmrg hup hspp hpp hpg hdp hdlp hcp ); } parse : cmdline /\Z/ { $item[1] } cmdline : pkgcmd opt(s?) { [ @item[1,2] ] } pkgcmd : IDENT pkgcmd_[ $item[1] ] { $item[1] } pkgcmd_ : { $KNOWN_COMMANDS{$arg[0]} } | <error:Unknown command> opt : FLAG opt_ { [ @item[1,2] ] } opt_ : IDENT | { 1 } FLAG : /-\w+/ IDENT : /\w+/ __END_OF_GRAMMAR__ Parse::RecDescent->Precompile($grammar, 'Parser') or die("Bad Grammar\n");

    In this post, I show how to deparse the tree created by the above grammar. For more complex trees, it can be very useful to have each non-token rule returned a bless object. Then some method in the class would be responsible for deparsing itself instead of relying on Deparse to do it. I'll get to that in a separate post.

    A quick note before contining: You'll notice the parse tree doesn't differentiate between "-opt" and "-opt 1", so you'll get "-opt 1" when you deparse it. This may be acceptable, where it would be a case of being "acceptably similar" as mentioned above. If it's not acceptable, then it's a case of the parse tree not holding enough information as is.

    Basically, I created a function for each rule. The function resemble the associated rule, but deparses it instead. Note that your grammar is very trivial (has no choices in it really), which is why you didn't need to use $item[0] (or some other constants) anywhere, and you don't have to make any choices when deparsing.

    # Deparse.pm use strict; use warnings; package Deparser; my $skip = ' '; sub add_skip { if (length($_[0])) { return $skip . $_[0]; } else { return ''; } } sub loop(&@) { my $cb = shift(@_); return join $skip, map { $cb->() } @_; } sub deparse { &cmdline } sub cmdline { my ($node) = @_; my ($pkgcmd, $opts) = @$node; my $text = pkgcmd($pkgcmd); $text .= add_skip(loop { opt($_) } @$opts); return $text; } sub pkgcmd { my ($node) = @_; return IDENT($node); } sub opt { my ($node) = @_; my ($flag, $val) = @$node; my $text = FLAG($flag); $text .= add_skip(IDENT($val)); return $text; } sub FLAG { return $_[0] }; sub IDENT { return $_[0] }; 1;

    As you can see, it's not too hard. Just make sure to reflect changes to your parser with changes to your deparser.

    Now let's use the code.

    # test.pl use strict; use warnings; use Data::Dumper qw( Dumper ); use Parser qw( ); use Deparser qw( ); { my $parser = Parser->new(); while (<DATA>) { chomp; print("Unparsed: $_\n"); my $tree = $parser->parse($_) or do { print("Bad data: $_\n"); next; }; my $dumper = Data::Dumper->new([ $tree ]); $dumper->Useqq(1); $dumper->Terse(1); $dumper->Indent(0); print("Parsed: ", $dumper->Dump(), "\n"); my $deparsed = Deparser::deparse($tree); print("Deparsed: $deparsed\n"); } continue { print("\n"); } } __DATA__ hap -b hap -b sknxharvest01 hap -b sknxharvest01 -enc testfile.dfo hap -b sknxharvest01 -usr cgowing -pass chaspass hap -badflag hap -prompt hap -b sknxharvest01 -prompt hap -prompt -b sknxharvest01

    Results:

    >perl build_parser.pl >perl test.pl Unparsed: hap -b Parsed: ["hap",[["-b",1]]] Deparsed: hap -b 1 Unparsed: hap -b sknxharvest01 Parsed: ["hap",[["-b","sknxharvest01"]]] Deparsed: hap -b sknxharvest01 Unparsed: hap -b sknxharvest01 -enc testfile.dfo Bad data: hap -b sknxharvest01 -enc testfile.dfo Unparsed: hap -b sknxharvest01 -usr cgowing -pass chaspass Parsed: ["hap",[["-b","sknxharvest01"],["-usr","cgowing"],["-pass"," +chaspass"]]] Deparsed: hap -b sknxharvest01 -usr cgowing -pass chaspass Unparsed: hap -badflag Parsed: ["hap",[["-badflag",1]]] Deparsed: hap -badflag 1 Unparsed: hap -prompt Parsed: ["hap",[["-prompt",1]]] Deparsed: hap -prompt 1 Unparsed: hap -b sknxharvest01 -prompt Parsed: ["hap",[["-b","sknxharvest01"],["-prompt",1]]] Deparsed: hap -b sknxharvest01 -prompt 1 Unparsed: hap -prompt -b sknxharvest01 Parsed: ["hap",[["-prompt",1],["-b","sknxharvest01"]]] Deparsed: hap -prompt 1 -b sknxharvest01

    Time for a break! Blessed version later.

      First of all, thanks to Joost and Blockhead for your input above. It's a valuable, helpful discussion, but I think the bulk of my response is to ikegami, here.

      ikegami, your work is VERY much appreciated. I've always admired the elegance of your code, and this is no exception. I have learned alot about writing P::RD grammars by just skimming yours, above, and know that with study I can learn more.

      One thing I keep thinking both in my own development process on this project and in reading your (and Joost's and blockhead's) responses is that I'm definitely in a simplistic, proof-of-concept phase on this. You've pointed out that my grammar is currently trivial, and I think that is partly why. My concern is in the general case where I may have a much expanded and much less trivial grammar to maintain. At this point your tricks and style suggestions will go a long way toward helping with that; thank you. When I realized the error of my assumptions with P::RD::Deparse, I thought I would end up writing a function or functions to deparse it manually, and your examples present a good organization for doing so.

      The question that still nags me is whether its necessary. I think your point about "-opt" and "-opt 1" prooves that a grammar is not necessarily reversible (and, I think, provides a problematic example for the joost/blockhead thread, above). Two different strings parse to the same parse tree. I could try to fix these on a case-by-case basis, but wonder if there's a general strategy for writing a grammar to avoid them (especially since I'm free to change the data structure emitted by the grammar if I please), or a general property of a grammar that could detect them (possibly for use in a mythic P::RD::ReverseParse for detecting error conditions in the input grammar). <update> What I'd like to avoid is making changes in two places; on the grammar side and on the deparse function side. </update>

      In the general case, you propose that P::RD Actions and Directives along with error checking make it not possible. What would be an example case to illustrate the limitations of a BNF parser (and is there one already written on CPAN? the closest hit I found was Parse::Marpa)?

      <update index='2' date='17-Mar-2008'> I stumbled across Tree::Parser, which seems to take coderefs to a parse subroutine separate from a deparse subroutine. This is consistent with ikegami's solution and is another example that contradicts my laziness in wanting to maintain the parse / deparse rules in one place. Perhaps I should give up and just write the deparse versions. </update>


      #my sig used to say 'I humbly seek wisdom. '. Now it says:
      use strict;
      use warnings;
      I humbly seek wisdom.
Re: Reversible parsing (with Parse::RecDescent?)
by Joost (Canon) on Mar 14, 2008 at 22:08 UTC
    I'm only vaguely familiar with parse::recdescent, but I'll try to give some general hints.
    * Is it necessarily possible to reverse the data through the grammar to get the command line?
    No. At least, for most grammars there are multiple variations of code that will end up as the same syntax tree / data structure. IOW if the original input is unambiguous, you should be able to reverse it into something equivalent but not necessarily character-for-character the same. If the input is ambiguous you'll have more problems.

    * or is this only possible for special grammars?
    I think in theory it would be possible to re-generate equivalent source data for any parse tree, given unambiguous grammar and input, but I'm not sure that parse::recdescent makes it easy (or even possible) to do that given only the grammar and the output data.

    Looking at your code it doesn't seem *that* difficult to come up with something to handle your cases, though you may have to deal with some duplication of grammar rules.

      No. At least, for most grammars there are multiple variations of code that will end up as the same syntax tree / data structure. IOW if the original input is unambiguous, you should be able to reverse it into something equivalent but not necessarily character-for-character the same. If the input is ambiguous you'll have more problems.

      You would still be able to do the direction that the OP wants, even in an ambiguous grammar. Say the OP uses a parser to convert string to parse tree. There may be multiple parse trees for that input, but the parser will find one. That parse tree is unambiguous and refers to just one string, so you will be able to go back. Formally, the deparsing operation is always a left inverse of the (set of) parsing operation(s), but only a right inverse if the grammar is unambiguous.

      Of course, the above discussion is all in the world of theoretical context-free languages, where the parse tree contains every production that was applied. In real life, we don't parse strings, we parse a stream of tokens, and anything lost in tokenization doesn't make it into the parse tree. We also flatten/simplify parse trees on the fly, resolve syntactic sugar, and do various other shortcuts.. not to mention non-context-free things that P::RD can do. Anyway, if your parse tree contains the important syntactic structure, as ikegami says, you can always deparse it back to obtain at least a syntactically equivalent string to the original.

      blokhead

Re: Reversible parsing (with Parse::RecDescent?)
by cutlass2006 (Pilgrim) on Mar 16, 2008 at 08:21 UTC

    this is not a 'humpty dumpty' operation it is possible .... but I agree with other responses in that it will be difficult to have perfect 'round tripping'.

    btw (note: this is unrelated to your problem, just noticed you might be working with XML) have you considered using XML::Descent

      Thanks for your feedback and suggestions. I checked out XML::Descent; it looks like a descent enough module (pun intended), but deals on the right side of my ascii-art design diagram, which was solved well enough with Data::Serializer and XML::Dumper for now. As I go from proof-of-concept to reality, I'll definitely consider it, thanks.


      #my sig used to say 'I humbly seek wisdom. '. Now it says:
      use strict;
      use warnings;
      I humbly seek wisdom.
        I built a first iteration of an XSLT lint checker using this module ... in the end I decided to use some other technology but I found the module well built and the concept better then rolling your own in RecDescent when it comes to working with XML. hth, Jim Fuller

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://674283]
Approved by Corion
Front-paged by mr_mischief
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2024-04-23 08:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found