http://qs321.pair.com?node_id=180778

Working Notes on Parse::RecDescent

Warning:

Since this isn't a tutorial, the following assumes that you are familiar with the module in question, at least enough so that the things mentioned are not a complete mystery. Another thing to be aware of is that these are answers that I found along the way-- what may have been obvious to me may be not be to you. And certainly vice versa! I'm writing this down because had I known any of this, I would have saved time and effort-- your mileage may well vary. Last caveat, any or all of this could be suspect, certainly better folk than I may have better ways of dealing with any or all of this (speak up if this is you...), all I know is that it works for me! (Got that line from tech support...)

Random Tips

  1. List alternates on separate lines. That means, not this:
    name: 'match' | 'name' | 'mode' | 'priority'
    but this:
    name: 'match' | 'name' | 'mode' | 'priority'
  2. Precede alternates with '|' rather than follow. Because eventually what you really want is:
    name: 'match' { [$item[0],$item[1]] } | 'name' { [$item[0],$item[1]] } | 'mode' { [$item[0],$item[1]] } | 'priority' { [$item[0],$item[1]] }
  3. Use sub AUTOLOAD Due to a less than careful reading of the camel book, I'd thought that this only worked in '.pm' files, i.e. as part of a package. But a little thinking reminded me that every thing is a package, so this magical feature is available to anyone with the need. Here is what I use:
    sub AUTOLOAD { my $tree = shift; our $AUTOLOAD; print STDERR "@@@ $AUTOLOAD(\$tree) @@@\n"; recurse($tree); }
    But what's it good for?” you ask. Well as the book points out, when a subroutine is undefined, then AUTOLOAD is called with the same arguments that would have been passed to the original subroutine. This is useful in at least two cases, when you want to see what was sent to a given function and when you haven't written the missing code. As an example, during the code and test cycle I'll make a change to some portion of the parse tree. If it is an addition, I'll hold off on writing the necessary code until I can confirm my 'expectations' about what the new code will be sent. If on the other hand a change causes a problem, then it is easy enough to change the effected functions (I just pre pend an underbar to the function name) and then analyze the resultant display of information. It is not a particularly sophisticated technique (a semi-modern version of the old IBM core dump), but I use it because it works, not because it's trendy!
  4. Use:
    • use strict;
    • no strict "refs";
    • use warnings;
    • use diagnostics;
    • use Parse::RecDescent;
    • use Data::Dumper;
    • use Carp::Assert;
    • use Carp;
    The first few are obvious. The last three provide a much needed tool kit as you work your way through your project. Given that the returned result of $data = $parser->startrule($test); is a complex data structure (a reference that refers to an array of arrays with literals along the way) Data::Dumper is your friend! The next 'use', allows you to test your assumptions as you code and the last allows you to identify just how you got to where you crashed! Think defensivly, act preemptively (old football slogan...)! Note for those who immediately criticize the less than complete use strict;, too bad...my code uses symbolic code refs, something I won't do without. (see sub startrule below)
  5. Use print Dumper $whatever,"\n"; as needed. Just because you are lost in a maze of twisty little passages, doesn't mean you can't get a decent road map.
  6. In addition to Dumper, don't hesitate to roll your own tree walkers. Code reuse is good, but sometimes it isn't what you need. Thing to remember here is that the parser returns a parse tree, nothing too complicated, either the 'current thing' is an array or it's not. If not then do something with the data, otherwise move to the next level and repeat the process.
  7. It is nice to say “Use the source Luke!” but the truth is that Parse::RecDescent is pretty opaque. You are probably better off using the tried and true print statement instead.
  8. Use $::RD_AUTOACTION = q { [@item[0..$#item]] }; until you know what you are doing and then replace each default action with one tailored to your needs.
  9. $item[0], is your friend, keep it around. A crude version of compiling is just two steps; build the parse tree and then execute the tree! Think of it this way, with almost no effort at all, the first item in an array in the tree is the rule name-- why not also think of it as the function name that will process the tree at that point. Here is my 'standard' sub startrule with code in red that depends on $item[0]:
    sub startrule { my $tree = shift; foreach (@$tree) { if ( ref eq 'ARRAY' ) { if ( ref( @$_[0] ) eq 'ARRAY' ) { startrule($_); } else { <span style="color:red">&{ @$_[0] } ($_);</span> } } } }
    Like I said, keep it around, it's useful!
  10. Place trial input in the __DATA__ section, this allows regression testing, so use discretion when weeding this out.
  11. Refactor constantly-- often portions of the grammar will become obsolete and will need to be pruned. Because this is the case, the previous item becomes even more important. Even if you have a fairly complete design before you commit to code, the process of building and testing will suggest changes, and changes will result in a certain amount of obsolecence-- hence the need for pruning shears!
  12. Replace all regex character sets with a symbolic reference. Well that's not quite correct, replace all but the last instance! In other words, things like:
    style_option: 'version' { [@item[0..$#item]] } | 'xmlns:xsl' { [@item[0..$#item]] } | 'id' { [@item[0..$#item]] } | 'extension-element-prefixes' { [@item[0..$#item]] } | 'exclude-result-prefixes' { [@item[0..$#item]] } | /[a-zA-Z0-9:_.\-]+/ { [@item[0..$#item]] }
    Becomes:
    style_option: 'version' { [@item[0..$#item]] } | 'xmlns:xsl' { [@item[0..$#item]] } | 'id' { [@item[0..$#item]] } | 'extension-element-prefixes' { [@item[0..$#item]] } | 'exclude-result-prefixes' { [@item[0..$#item]] } | char_set { [@item[0..$#item]] } char_set: /[a-zA-Z0-9:_.\-]+/ { $item[1] }
    Think of this as yet another version of the 'No magic numbers' rule. Besides it's likely that you will want to use 'char_set' elsewhere and having a single point of definition makes later changes manageable!
  13. New sym-refs need not disturb the parse tree, you can 'stealth' them in-- i.e. { $item[1] }, in place of { [$item[0],$item[1]] }. (See previous example.)
  14. Parse trees usually don't need literals, remove them when you can. For instance, if you have something like this:
    preserve_space: 'preserve-space' '[' 'elements' '=' qstring ']' paren(?)
    You will do better to have an action like this:
    { [$item[0],$item[5]] }
    As you can see, 'qstring' is the only significant bit here so the returned value for this rule is rule-name followed by rule-value.
  15. If you use the 'special magic' to create a grammar class from the command line using:
    > perl -MParse::RecDescent - grammar Yet::Another::Grammar
    be aware that this might not work the same if the original (presumably in-line) grammar depended on $::RD_AUTOACTION = q { [@item[0..$#item]] }; or similar. The magic method ignores RD_AUTOACTION and uses it's own default action as needed. Solution is to duplicate the auto action by hand in the grammar file-- this is not such a big deal since by then most actions will have already been customized.
  16. There is still no free lunch! Eventually you will come to a point where you need to parse either multi-line comments or something similar. Be aware that just because every one said to use Parse::RecDescent doesn't mean that the answer is easy. It's not, you still need to bite the bullet and do the work. You may at first think that &lt;perl_quotelike&gt; is the way out. Do not be deceived! When the documentation says

    Parse::RecDescent provides limited support for parsing subsets of Perl, namely: quote-like operators, Perl variables, and complete code blocks.
    it is being literal. If your language is not 'Perl' then this short cut will not get you to the church on time!

    Least I be accused of talking around the problem, here is what I do to support multi-line comments:

    xcomment: &lt;skip: qr/[ \t]*/&gt; newline(0..) '&lt;!--' { ($text,$return) = main::parse_delimited($text,'&lt;!--','- +-&gt;'); $return = ['xcomment',$return]; }
    Where the function used looks like:
    #_________________________________________________________________ +_____________ sub parse_delimited { my $text = shift; my $startdelim = shift; my $enddelim = shift; my $mc = new Text::DelimMatch( $startdelim, $enddelim ); my ( $p, $m, $r ) = $mc->match( $startdelim . $text ); if ($p) { $text = $p; } else { $text = ""; } $text .= $r if ($r); $m =~ s/^$startdelim//; $m =~ s/$enddelim$//; return $text, $m; } #_________________________________________________________________ +_____________
    It is not a perfect solution, as the documentation says,
    Modifying the value of the variable $text may confuse the column counting mechanism
    but other than that it does have the virtue of 'working'!
  17. Greediest production in a rule goes last. For instance given the following:
    startrule: is_printable(s) {[$item[0],$item[1]]} | name {[$item[0],$item[1]]} is_printable: <skip: ''> /[[:print:]]+/ { [$item[0],$item[2]] } name: 'match' { [@item[0..$#item]] } | 'name' { [@item[0..$#item]] } | 'mode' { [@item[0..$#item]] } | 'priority' { $item[1] }
    You are never going to get to 'name', because 'is_printable' will consume all of the characters in any given 'name', do not pass Go, do not etc. Further, correct ordering from least greedy to most, allows the last sub-rule to act as a backstop for the rule in general.

–hsm

"Never try to teach a pig to sing…it wastes your time and it annoys the pig."

Replies are listed 'Best First'.
Re: Random Tips on Parse::RecDescent
by Aristotle (Chancellor) on Jul 11, 2002 at 23:01 UTC
    Useful compilation, but you can go back and reenable strict 'refs'.
    #!/usr/bin/perl -w use strict; for (qw(foo bar)) { ($main::{$_} or sub { print "No such sub: $_\n" })->(); } sub foo { print "Yes, I'm here.\n"; } __END__ Yes, I'm here. No such sub: bar
    Alternatively: (UNIVERSAL::can('main', $sub) or sub { print "No such sub: $_\n" })->(); ____________
    Makeshifts last the longest.
      Ok, I'll bite...
      #!/perl/bin/perl # # coderefs.pl -- Yes you can use strict... use strict; use warnings; use diagnostics; for (qw(foo bar)) { ($main::{$_} or sub { print "No such sub: $_\n" })->(); } $_ = 'baz'; $main::{$_}(); # works! but, &{$_}(); # doesn't... sub foo { print "Yes, I'm here.\n"; } sub baz { print "I'm here as well.\n"; } __END__
      So my question is why?

      –hsm

      "Never try to teach a pig to sing…it wastes your time and it annoys the pig."

        Because %PACKAGE:: is a special hashtable containing all global symbols from the given PACKAGE. Via that hash, a hard reference to the desired subroutine can be looked up. The call to UNIVERSAL::can produces the same result. And obviously calling a subroutine by dereferencing a hard reference is allowed under the stricture.

        Try the following sometime:

        $ perl -MData::Dumper -e'print Dumper(\%main::);'

        It's rather interesting to poke around in there.

        Basically, a lot of things in Perl (all of OO really, f.ex) are symbolic lookups, so they cannot be evil by nature. What is evil is accidentally using symbolic lookups where you meant a hard dereference. If your code has a bug so that it happens to put a string rather than the hard reference you intended into $var, without the stricture $$var will still work but suddenly becomes a soft reference. If that only happens sporadically, the resulting bugs can be incredibly hard to spot. That's what strict 'refs' catches, and that's why I strongly suggest you reenable it. You're depriving yourself of a very important safety net otherwise.

        Conversly, when you really do need a symbolic lookup, you can still achieve it in ways strict won't complain about. It's just that you explicitly spell out that you do in fact want a symbolic lookup and are fully aware that it's happening.

        Makeshifts last the longest.

Re: Random Tips on Parse::RecDescent
by educated_foo (Vicar) on Jul 11, 2002 at 15:49 UTC
    Excellent node! I have a couple of questions:
    1. Why do you do [@item[0..$#item]] rather than just [@item]? IIRC, the second will copy the array as well, so there's no need to take a slice.
    2. re: number 15 -- I've noticed that it complains about RD_AUTOACTION when you use Precompile (which is the same as the command-line?), but it seems to still append auto-actions just fine. Is this the behavior you saw?
    And finally, one additional suggestion to add: Do as little as possible in your actions, particularly in the ones in lower-level rules. Building up and tearing down parse trees can be awfully slow, so it may be faster to just do the fastest thing possible during the parse, and post-process it afterwards.

    /s

      About the [@item[@0..$#item]] thing...no reason other than using the supplied example from the docs (after all, good enough for damian, good enough for me). chromatic caught this when I posted, I checked and as you say, there is no difference other than less typing!

      As far as the #15 goes, the problem is that the grammar file has no way to contain the RD_AUTOACTION specification with the command line as given. I also played with the expanded .pl version but still couldn't get it to provide identical behavior. What did work was the suggested sledge hammer approach--fix each instance by hand! There may have been a discernable pattern to what was used as default, but I didn't see it and didn't have the time to track it down.

      I like your final 'additional suggestion', the Parse::RecDescent version of 'do as little as possible to get the job done'.

      –hsm

      "Never try to teach a pig to sing…it wastes your time and it annoys the pig."
Re: Random Tips on Parse::RecDescent
by herveus (Prior) on Jul 11, 2002 at 15:01 UTC
    Howdy!

    Thanks for collecting and sharing your thoughts on this. I tried once to use Parse::RecDescent as an exercise (the source file was not hard to parse via simple regexen) and got bitten on the ass by something I never figured out.

    This might inspire me to go back and try again...

    yours,
    Michael

Re: Random Tips on Parse::RecDescent
by davistar (Novice) on Feb 02, 2006 at 19:39 UTC
    hsm,
    Thanks for the excellent tips on PRD! I'm trying to use tip #16 for C-like multiline comments '/*' and '*/'.
    I don't quite understand your syntax for the following production:
    xcomment: &lt;skip: qr/[ \t]*/&gt; newline(0..) '&lt;!--' { ($text,$return) = main::parse_delimited($text,'&lt;!--','- ++-&gt;'); $return = ['xcomment',$return]; }
    I translated your xcomment production into the following production:
    comment: <skip: qr/[ \t]*/> newline(0..) '/*' { ($text,$return) = main::parse_delimited($text,'/*','*/'); $return = ['comment',$return]; }
    My question is I don't understand the newline(0..) part I must missing something obvious. I had errors if I used newline(0..) and main::parse_delimited so I removed newline(0..) and changed main:: to $thisparser-> and ran the following test with no compile errors but it fails to match the comment production. What am I missing? Any help debugging would be appreciated!
    #!/usr/bin/perl -w use strict; use Parse::RecDescent; use Text::DelimMatch; $::RD_ERRORS = 1; $::RD_WARN = 1; $::RD_HINT = 1; $::RD_TRACE = 1; my $grammar = q{ { sub parse_delimited { my $text = shift; my $startdelim = shift; my $enddelim = shift; my $mc = new Text::DelimMatch( $startdelim, $enddelim ); my ( $p, $m, $r ) = $mc->match( $text ); if ($p) { $text = $p; } else { $text = ""; } $text .= $r if ($r); $m =~ s/^$startdelim//; $m =~ s/$enddelim$//; return $text, $m; } } file: line(s) eofile { use Data::Dumper 'Dumper'; print Dumper @item} line: comment | <error> eofile: /^\Z/ comment: <skip: qr/[ \t]*/> '/*' { ($text,$return) = $thisparser->parse_delimited($text, '/*' +, '*/'); $return = ['comment',$return]; } }; my $parser = new Parse::RecDescent($grammar) or die "Bad grammar!\n"; while (<DATA>) { chomp; print "$_...\n"; defined($parser->file($_)) or print "Bad text!\n"; } __DATA__ /*Hello World */
    I also get the following errors in the trace which are disconcerting but I don't think it's causing the problem:
    Argument "/*" isn't numeric in addition (+) at C:/Perl/site/lib/Parse/ +RecDescent.pm line 2783, <DATA> line 1. Use of uninitialized value in substitution (s///) at (eval 15)[C:/Perl +/site/lib/Parse/RecDescent.pm:2618] line 22, <DATA> line 1. Use of uninitialized value in concatenation (.) or string at (eval 15) +[C:/Perl/site/lib/Parse/RecDescent.pm:2618] line 23, <DATA> line 1. Use of uninitialized value in substitution (s///) at (eval 15)[C:/Perl +/site/lib/Parse/RecDescent.pm:2618] line 23, <DATA> line 1. Use of uninitialized value in substitution (s///) at (eval 15)[C:/Perl +/site/lib/Parse/RecDescent.pm:2618] line 23, <DATA> line 1.

      The example in random tip #16 does not explicitly spell out that you need to :

      use Text::DelimMatch;

      and in the grammar definition you need a rule :

      newline: "\n"

      With those two things in place the code works fine for parsing multi line HTML comments.

      Parsing multi line C style comments is complicated by the fact that * is a regexp character so it needs escaping. I managed to get the following code to work OK based on technique outlined in the tip. I'm sure it could be done better but I was struggling with the escaping

      # Function to cope with multiline comments # Must be placed in main section of program sub parse_multilinecomment { my $text = shift; my $mc = new Text::DelimMatch( '\\/\\*', '\\*\\/' ); my ( $p, $m, $r ) = $mc->match( '/*' . $text ); if ($p) { $text = $p; } else { $text = ""; } $text .= $r if ($r); $m =~ s/^\/\*//; $m =~ s/\*\/$//; return $text, $m; }

      and the grammar rules :

      newline: "\n" multilinecomment: <skip: qr/[ \t]*/> newline(0..) '/*' { ($text,$return) = main::parse_multilinecomment($text); print $return . "\n"; $return = ['xcomment',$return]; }

      Successfully matches the following example :

      /* A multiple line /* with nested */ comment */

      Hope this may help someone

      Adrian

      I'll take a look as soon as I blow the dust off of the section of my brain that once retained this stuff!! Please do not hold your breath while I am doing so, but know that I am in fact checking it out...

      --hsm

      "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
        hsm,
        I realize this is an old post and I won't hold my breath ;^)
        Please don't spend too much time on it I just was hoping it might ring some bells. I will also continue to investigate.
        Thanks!
A reply falls below the community's threshold of quality. You may see it by logging in.