http://qs321.pair.com?node_id=180778

Working Notes on Parse::RecDescent

Warning:

Since this isn't a tutorial, the following assumes that you are familiar with the module in question, at least enough so that the things mentioned are not a complete mystery. Another thing to be aware of is that these are answers that I found along the way-- what may have been obvious to me may be not be to you. And certainly vice versa! I'm writing this down because had I known any of this, I would have saved time and effort-- your mileage may well vary. Last caveat, any or all of this could be suspect, certainly better folk than I may have better ways of dealing with any or all of this (speak up if this is you...), all I know is that it works for me! (Got that line from tech support...)

Random Tips

  1. List alternates on separate lines. That means, not this:
    name: 'match' | 'name' | 'mode' | 'priority'
    but this:
    name: 'match' | 'name' | 'mode' | 'priority'
  2. Precede alternates with '|' rather than follow. Because eventually what you really want is:
    name: 'match' { [$item[0],$item[1]] } | 'name' { [$item[0],$item[1]] } | 'mode' { [$item[0],$item[1]] } | 'priority' { [$item[0],$item[1]] }
  3. Use sub AUTOLOAD Due to a less than careful reading of the camel book, I'd thought that this only worked in '.pm' files, i.e. as part of a package. But a little thinking reminded me that every thing is a package, so this magical feature is available to anyone with the need. Here is what I use:
    sub AUTOLOAD { my $tree = shift; our $AUTOLOAD; print STDERR "@@@ $AUTOLOAD(\$tree) @@@\n"; recurse($tree); }
    But what's it good for?” you ask. Well as the book points out, when a subroutine is undefined, then AUTOLOAD is called with the same arguments that would have been passed to the original subroutine. This is useful in at least two cases, when you want to see what was sent to a given function and when you haven't written the missing code. As an example, during the code and test cycle I'll make a change to some portion of the parse tree. If it is an addition, I'll hold off on writing the necessary code until I can confirm my 'expectations' about what the new code will be sent. If on the other hand a change causes a problem, then it is easy enough to change the effected functions (I just pre pend an underbar to the function name) and then analyze the resultant display of information. It is not a particularly sophisticated technique (a semi-modern version of the old IBM core dump), but I use it because it works, not because it's trendy!
  4. Use:
    • use strict;
    • no strict "refs";
    • use warnings;
    • use diagnostics;
    • use Parse::RecDescent;
    • use Data::Dumper;
    • use Carp::Assert;
    • use Carp;
    The first few are obvious. The last three provide a much needed tool kit as you work your way through your project. Given that the returned result of $data = $parser->startrule($test); is a complex data structure (a reference that refers to an array of arrays with literals along the way) Data::Dumper is your friend! The next 'use', allows you to test your assumptions as you code and the last allows you to identify just how you got to where you crashed! Think defensivly, act preemptively (old football slogan...)! Note for those who immediately criticize the less than complete use strict;, too bad...my code uses symbolic code refs, something I won't do without. (see sub startrule below)
  5. Use print Dumper $whatever,"\n"; as needed. Just because you are lost in a maze of twisty little passages, doesn't mean you can't get a decent road map.
  6. In addition to Dumper, don't hesitate to roll your own tree walkers. Code reuse is good, but sometimes it isn't what you need. Thing to remember here is that the parser returns a parse tree, nothing too complicated, either the 'current thing' is an array or it's not. If not then do something with the data, otherwise move to the next level and repeat the process.
  7. It is nice to say “Use the source Luke!” but the truth is that Parse::RecDescent is pretty opaque. You are probably better off using the tried and true print statement instead.
  8. Use $::RD_AUTOACTION = q { [@item[0..$#item]] }; until you know what you are doing and then replace each default action with one tailored to your needs.
  9. $item[0], is your friend, keep it around. A crude version of compiling is just two steps; build the parse tree and then execute the tree! Think of it this way, with almost no effort at all, the first item in an array in the tree is the rule name-- why not also think of it as the function name that will process the tree at that point. Here is my 'standard' sub startrule with code in red that depends on $item[0]:
    sub startrule { my $tree = shift; foreach (@$tree) { if ( ref eq 'ARRAY' ) { if ( ref( @$_[0] ) eq 'ARRAY' ) { startrule($_); } else { <span style="color:red">&{ @$_[0] } ($_);</span> } } } }
    Like I said, keep it around, it's useful!
  10. Place trial input in the __DATA__ section, this allows regression testing, so use discretion when weeding this out.
  11. Refactor constantly-- often portions of the grammar will become obsolete and will need to be pruned. Because this is the case, the previous item becomes even more important. Even if you have a fairly complete design before you commit to code, the process of building and testing will suggest changes, and changes will result in a certain amount of obsolecence-- hence the need for pruning shears!
  12. Replace all regex character sets with a symbolic reference. Well that's not quite correct, replace all but the last instance! In other words, things like:
    style_option: 'version' { [@item[0..$#item]] } | 'xmlns:xsl' { [@item[0..$#item]] } | 'id' { [@item[0..$#item]] } | 'extension-element-prefixes' { [@item[0..$#item]] } | 'exclude-result-prefixes' { [@item[0..$#item]] } | /[a-zA-Z0-9:_.\-]+/ { [@item[0..$#item]] }
    Becomes:
    style_option: 'version' { [@item[0..$#item]] } | 'xmlns:xsl' { [@item[0..$#item]] } | 'id' { [@item[0..$#item]] } | 'extension-element-prefixes' { [@item[0..$#item]] } | 'exclude-result-prefixes' { [@item[0..$#item]] } | char_set { [@item[0..$#item]] } char_set: /[a-zA-Z0-9:_.\-]+/ { $item[1] }
    Think of this as yet another version of the 'No magic numbers' rule. Besides it's likely that you will want to use 'char_set' elsewhere and having a single point of definition makes later changes manageable!
  13. New sym-refs need not disturb the parse tree, you can 'stealth' them in-- i.e. { $item[1] }, in place of { [$item[0],$item[1]] }. (See previous example.)
  14. Parse trees usually don't need literals, remove them when you can. For instance, if you have something like this:
    preserve_space: 'preserve-space' '[' 'elements' '=' qstring ']' paren(?)
    You will do better to have an action like this:
    { [$item[0],$item[5]] }
    As you can see, 'qstring' is the only significant bit here so the returned value for this rule is rule-name followed by rule-value.
  15. If you use the 'special magic' to create a grammar class from the command line using:
    > perl -MParse::RecDescent - grammar Yet::Another::Grammar
    be aware that this might not work the same if the original (presumably in-line) grammar depended on $::RD_AUTOACTION = q { [@item[0..$#item]] }; or similar. The magic method ignores RD_AUTOACTION and uses it's own default action as needed. Solution is to duplicate the auto action by hand in the grammar file-- this is not such a big deal since by then most actions will have already been customized.
  16. There is still no free lunch! Eventually you will come to a point where you need to parse either multi-line comments or something similar. Be aware that just because every one said to use Parse::RecDescent doesn't mean that the answer is easy. It's not, you still need to bite the bullet and do the work. You may at first think that &lt;perl_quotelike&gt; is the way out. Do not be deceived! When the documentation says

    Parse::RecDescent provides limited support for parsing subsets of Perl, namely: quote-like operators, Perl variables, and complete code blocks.
    it is being literal. If your language is not 'Perl' then this short cut will not get you to the church on time!

    Least I be accused of talking around the problem, here is what I do to support multi-line comments:

    xcomment: &lt;skip: qr/[ \t]*/&gt; newline(0..) '&lt;!--' { ($text,$return) = main::parse_delimited($text,'&lt;!--','- +-&gt;'); $return = ['xcomment',$return]; }
    Where the function used looks like:
    #_________________________________________________________________ +_____________ sub parse_delimited { my $text = shift; my $startdelim = shift; my $enddelim = shift; my $mc = new Text::DelimMatch( $startdelim, $enddelim ); my ( $p, $m, $r ) = $mc->match( $startdelim . $text ); if ($p) { $text = $p; } else { $text = ""; } $text .= $r if ($r); $m =~ s/^$startdelim//; $m =~ s/$enddelim$//; return $text, $m; } #_________________________________________________________________ +_____________
    It is not a perfect solution, as the documentation says,
    Modifying the value of the variable $text may confuse the column counting mechanism
    but other than that it does have the virtue of 'working'!
  17. Greediest production in a rule goes last. For instance given the following:
    startrule: is_printable(s) {[$item[0],$item[1]]} | name {[$item[0],$item[1]]} is_printable: <skip: ''> /[[:print:]]+/ { [$item[0],$item[2]] } name: 'match' { [@item[0..$#item]] } | 'name' { [@item[0..$#item]] } | 'mode' { [@item[0..$#item]] } | 'priority' { $item[1] }
    You are never going to get to 'name', because 'is_printable' will consume all of the characters in any given 'name', do not pass Go, do not etc. Further, correct ordering from least greedy to most, allows the last sub-rule to act as a backstop for the rule in general.

–hsm

"Never try to teach a pig to sing…it wastes your time and it annoys the pig."