Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Some Tips(from a beginner) for using Parse::RecDescent:

I'll start with a simple example. Suppose you want to parse lines of text that look like this:

    employee Joe 10

Here is another example of such a line:

    employee Cathy 14

A line of text consists of the literal 'employee' followed by a name and an id. To parse the text, first you define a rule such as employee_info:

my $grammar = <<'END_OF_GRAMMAR'; employee_info: 'employee' name id name: m{ \S+ }xms id: m{ \d+ }xms END_OF_GRAMMAR

Then you parse the text:

my $text = "employee Joe 10"; my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->employee_info($text) or die "Text doesn't match";

But if you run that in a fully fleshed out program(which you'll see soon enough), it will produce a big fat nothing for output. Yet, due to the fact that neither of the die() error messages were displayed, you know that your grammar didn't have any errors and that the text matched.

To actually produce some output, you need to add an Action. An Action is executed when the parser finds a match for the rule. Here is what an Action looks like:

my $grammar = <<'END_OF_GRAMMAR'; employee_info: 'employee' name id { print "$_\n" for @item; } #Action name: m{ \S+ }xms id: m{ \d+ }xms END_OF_GRAMMAR

The array @item is provided by Parse::RecDescent, and it contains the text that matches the rule. Here is a complete sample program using the employee_info rule and its associated action followed by the output:

use strict; use warnings; use 5.012; use Parse::RecDescent; $::RD_ERRORS = 1; #Parser dies when it encounters an error $::RD_WARN = 1; #Enable warnings - warn on unused rules &c. $::RD_HINT = 1; #Give out hints to help fix problems. #$::RD_TRACE = 1; #Trace parsers' behaviour my $grammar = <<'END_OF_GRAMMAR'; employee_info: 'employee' name id { print "$_\n" for @item; } #Action name: m{ \S+ }xms id: m{ \d+ }xms END_OF_GRAMMAR my $text = "employee Joe 10"; my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->employee_info($text) or die "Text doesn't match"; --output:-- employee_info employee Joe 10

Note that @item contains the rule name at index position 0.

The way the parser works is it takes your text and splits it on whitespace, e.g. producing ('employee', 'Joe', '10'), and then the parser sees if those pieces match your rule.

1) The matches are in @item and %item.

When a rule matches, the @item array contains the rule name at index position 0, with successive index positions containing the text that matched each term in the rule. Similarly, %item is a hash where the keys are the term names and the values are the matched text. However, term names can get complex(see tip #7), so often it is easier to use @item to get the matches, e.g. $item[-1].

However things aren't always so straightforward. Suppose you want to parse text like this:

    { hello }

So you come up with this grammar:

myrule: brace_clause brace_clause: '{' word '}' word: m{ [a-z]+ }xms

To see the matches for the brace_clause rule, you might add an action like this:

myrule: brace_clause brace_clause: '{' word '}' { print "$_\n" for @item; } word: m{ [a-z]+ }xms

That would produce this output:

brace_clause #the rule name { #the match for '{' hello #the match for word } #the match for '}'

Okay, no surprises there. But what if you move the action so that it is under myrule, like this:

myrule: brace_clause { print "$_\n" for @item; } brace_clause: '{' word '}' word: m{ [a-z]+ }xms

What do you expect the output to be now? Maybe this:

myrule { hello }

The actual output is:

myrule }

What the? It turns out that when a rule is used as a subrule, the subrule only produces what matched its last term, which in this case is a literal '}'. Yes, that effect will make you tear your hair out at some point.

In order to send along the entirety of the matched text to another rule, you'll need to retrieve all the matches from @item and join() them into a string:

myrule: brace_clause { print "$_\n" for @item; } brace_clause: '{' word '}' { join ' ', @item[1..3] } word: m{ [a-z]+ }xms --output:-- myrule { hello }

Just remember that often times one of the elements in @item will be a reference to an array of matches, so joining all the matches together may take several lines of code. As always, use Data::Dumper to display the structure of @item so that you can figure out how to retrieve all the matches.

2) Add a Start-up action to your grammar to use Data::Dumper.

The parser executes in a different namespace than your program, so the parser can't see any use statements at the top of your program. As a result, if you need to use a module inside several actions you can put a use statement in a Start-up action:
my $grammar = <<'END_OF_GRAMMAR'; #Start up action(executed in parser namespace): { use 5.012; #enable say() use Data::Dumper; } … … END_OF_GRAMMAR

With that Start-up action, you can call the functions defined in Data::Dumper to display @item in any action. Using Data::Dumper will allow you to see exactly what form the matches are in (a string? an array of strings? an array of arrays of strings?). I suggest Dumping @item as the first line in an action and not writing any additional code in the action until you examine the output:

my $grammar = <<'END_OF_GRAMMAR'; #Start up action(executed in parser namespace): { use 5.012; #enable say() use Data::Dumper; } employee_info: 'employee' name id { say Dumper(\@item); } #Action name: m{ \S+ }xms id: m{ \d+ }xms END_OF_GRAMMAR my $text = "employee Joe 10"; my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->employee_info($text) or die "Text doesn't match"; --output:-- $VAR1 = [ 'employee_info', 'employee', 'Joe', '10' ];

Once you see the exact layout of the matches in @item, it is much easier to figure out the correct syntax for retrieving the information you want.

3) Actions change the matched text.

The return value of an Action is the value of the last expression executed inside the action. Furthermore, the return value of the action becomes the substitute for the text that actually matched the rule. That effect rears its ugly head when one rule incorporates another rule:
use strict; use warnings; use 5.012; use Parse::RecDescent; $::RD_ERRORS = 1; #Parser dies when it encounters an error $::RD_WARN = 1; #Enable warnings - warn on unused rules &c. $::RD_HINT = 1; #Give out hints to help fix problems. #$::RD_TRACE = 1; #Trace parsers' behaviour my $grammar = <<'END_OF_GRAMMAR'; { use Data::Dumper; use 5.012; #enable say() } another_rule: 'new' employee_info { say Dumper(\@item); } employee_info: 'employee' name id { say Dumper(\@item); say '-' x 20; 'hello world'; } name: m{ \S+ }xms id: m{ \d+ }xms END_OF_GRAMMAR my $text = "new employee Joe 10"; my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->another_rule($text) or die "Text doesn't match"; --output:-- $VAR1 = [ 'employee_info', 'employee', 'Joe', '10' ]; -------------------- $VAR1 = [ 'another_rule', 'new', 'hello world' ];
See how the matched text inside the employee_info rule's action was 'employee', 'Joe', '10', but in another_rule, which contains the term employee_info, the matched text for employee_info has changed to 'hello world'?

A common problem is seeing 1 displayed as the match for part of a rule. You need to remember that print() or say() return 1, so if either of those statements is the last thing executed in an action, the action will return 1 as the matched text to another rule. When you are tearing your hair out, come back to this tip and re-read it, then write down what the actions in your grammar return and stare at the values for awhile; then see if those values appear anywhere in your output.

It's also possible to insert an action in the middle of a rule--rather than at the end. For instance, you can do this:

employee_info: 'employee' name { say $item[1]; } id name: m{ \S+ }xms id: m{ \d+ }xms

But there is a side effect of doing that: the return value of the action will be inserted into @item just after whatever matched the subrule. That can cause problems if you try to do something like this:

employee_info: 'employee' name { say $item[1]; } id { say $item[3] } #print match for id

The match for id is not at position 3 in @item--the match is at position 4 because the first action inserted something in @item. Once again, if you get strange errors when trying to retrieve or print out matches, you should use Data::Dumper to display @item to see exactly where a match is located in @item.

4) Some comments cause errors.

The comment at the end of the line in this action is benign:

my $grammar = <<'END_OF_GRAMMAR'; #Start up action(executed in parser namespace): { use 5.012; #So I can use say() } ...

...but compressing that action into one line:

my $grammar = <<'END_OF_GRAMMAR'; #Start up action(executed in parser namespace): { use 5.012; #So I can use say() } ...

...causes this error:

Unknown starting rule (Parse::RecDescent::namespace000001::startrule) +called at 3.pl line 76.

As a result, I recommend that you not use trailing comments.

5) Why do I keep getting the errors:

  1. "Unknown starting rule (Parse::RecDescent::namespace000001::non_rule) called"
  2. "Text doesn't match"

a) Don't forget to change the line:
defined $parser->another_rule($text) or die "Text doesn't match";

...to reflect the new rule name when you start adding or changing rule names. Or, it could be a comment causing the error (see tip #4).

b) Similarly, make sure you update your $text string when testing a new rule.

6) Parsing delimited lists.

If you have text like this:

    hello,world,goodbye,mars

You can parse it with this rule:

word_list: word(s /,/) word : m{ [^,]+ }xms

Here's how that works. A subrule such as word(s) will match one or more words, so if you have this grammar:

 
    word_list:  word(s)
                { 
                  say Dumper(\@item);
                }

    word: m{ \S+ }xms

...it will match text like this:

  hello
  hello world
  hello world goodbye mars

In addition, the syntax word(s) allows you to specify a regex as the separator between the words, e.g. word(s /,/). So if you have grammar like this:

word_list: word(s /,/) { say Dumper(\@item); } word : m{ [^,]+ }xms

…then word(s /,/) will match text like this:


   hello
   hello,world
   hello,world,goodbye,mars

Here is some sample Data::Dumper output:

$VAR1 = [ 'word_list', [ 'hello', 'world', 'goodbye', 'mars' ] ];

You have to be a careful about how you define the word rule because word(s) can also match a single word, and the parser will greedily eat up as much text as it can trying to match a single word. As a result, note how the regex for word changed when parsing the comma separated list.

A delimited list can also look like this:

    apple or strawberry or cherry

Instead of being delimited by a comma, the words are delimited by 'or'. That delimited list is even easier to parse:

my $grammar = <<'END_OF_GRAMMAR'; { use Data::Dumper; use 5.012; #enable say() } delimited_or_list: word(s /or/) { say Dumper(\@item); } word: m{ \S+ }xms END_OF_GRAMMAR my $text = "apple or strawberry or cherry"; my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->delimited_or_list($text) or die "Text doesn't match"; --output:-- $VAR1 = [ 'delimited_or_list', [ 'apple', 'strawberry', 'cherry' ] ];

Unlike in the previous example, this time you don't have to worry about the delimiter when constructing the regex for the word rule. I'm not quite clear on the details of that, but I think it has something to do with how the parser splits the text on whitespace before trying to match the rule. In the previous example, the text is really a single token, ("hello,world,goodbye,mars"), where here you are matching against the tokens: ('apple', 'or', 'strawberry', 'or', 'cherry').

7) The keys in %item might not be what you think.

If you have the following rules:
some_rule_name: 'hello' word(s /,/) word : m{ [^,]+ }xms

…then the key in the %item hash for the text that matched word(s /,/) is the unwieldy key 'word(s /,/)'--not the key 'word'. So you could grab the match by writing:

$item{'word(s /,/)'}

...but that is difficult to type and it looks like hell, so I find it easier and clearer to use the @item array instead and write:

$item[-1]

8) Parsing quoted strings.

If you have text like this:
    commands-> 'go to' 'stop' 
    commands-> 'next' 'back up' 

…and you want to get the text inside the quotes, you can use this grammar:

    cmd_choices: 'commands->' quoted_string(s)    
                 { say Dumper(\@item); }
                  
    quoted_string: <perl_quotelike>

The <perl_quotelike> thing is a predefined action which handles parsing interior quotes inside the text (it actually matches any perl "quote-like operator", see the Parse:RecDescent docs). Remember that actions return the value of the last expression executed in the action, and the action's return value is considered to be the text that matched the rule. As a result, whatever the <perl_quotelike> action returns is considered the matching text for each quoted_string in the cmd_choices rule. Here is some sample Data::Dumper output:

$VAR1 = [ 'command_choices', 'commands->', [ [ '', '\'', 'go to', '\'', '', '', '', '' ], [ '', '\'', 'stop', '\'', '', '', '', '' ] ] ];

Whoa. What is that mess? The <perl_quotelike> action returns an array of arrays, where each sub array is an 8 element array containing information about one of the quoted strings that matched, which is mostly blank for a string quoted with " or '. At index position 2 in the sub arrays is the text that was inside the quotes, and at index positions 1 and 3 are the actual quote marks that were found. If you want the text that was inside the quote marks, you just have to figure out the right syntax in order to grab the values at index position 2 in each array (explanation after the code):

cmd_choices: 'commands->' quoted_string(s) { my @results = map { $_->[2] } @{$item[-1]}; say for @results; } quoted_string : <perl_quotelike> --output:-- go to stop

The last item in @item, $item[-1], is whatever matched the last subrule. The last subrule in word_list is the rule quoted_string(s), and from the Data::Dumper output you can see that $item[-1] is a reference to an array of arrays. If you dereference $item[-1], @{$item[-1]}, you get an array where the items, $_ , are references to arrays. Each array reference is a reference to an array where index position 2, $_->[2], contains the text inside the quotes.

9) Saving the matched data as the parser moves along the text.

I read a tutorial that saves matches (or anything else) like this:
use strict; use warnings; use 5.012; use Parse::RecDescent; $::RD_ERRORS = 1; #Parser dies when it encounters an error $::RD_WARN = 1; #Enable warnings - warn on unused rules &c. $::RD_HINT = 1; #Give out hints to help fix problems. #$::RD_TRACE = 1; #Trace parsers' behaviour our %RESULTS; #Need to declare the variable, but my variables #won't be seen in another namespace. So declare #a global variable with our(). my $grammar = <<'END_OF_GRAMMAR'; #Start up action(executed in parser namespace): { use 5.012; #enable say() use Data::Dumper; } some_rule: name id { $main::RESULTS{names} = $item{name}; $main::RESULTS{phone_numbers} = $item{id}; } name: m{ \S+ }xms id: m{ \d+ }xms END_OF_GRAMMAR my $text = "Joe 10"; my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->some_rule($text) or die "Text doesn't match"; use Data::Dumper; say Dumper(\%RESULTS); #Do something with %RESULTS --output:-- $VAR1 = { 'names' => 'Joe', 'phone_numbers' => '10' };

10) Backreferences.

Suppose you have some text like this:

{{ hello }}

But the text can have a variable number of opening braces, say n opening braces, followed by 'hello', followed by n closing braces. The problem is that in order to match the closing braces, you need to know how many opening braces matched. Because you are able to interpolate variables into literal strings or regular expressions in your rules, you can construct a backreference like this:

my $text = <<'END_OF_TEXT'; {{ hello }} END_OF_TEXT my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; use Data::Dumper; } #Declare some my() variables for use within the rule: brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(1..) { $lbraces = join '', @{$item[1]}; $rbraces = '}' x length $lbraces; } 'hello' "$rbraces" { say "$lbraces $item[3] $rbraces"; } lbrace: / [{] /xms END_OF_GRAMMAR

Or, perhaps this is cleaner:

my $text = <<'END_OF_TEXT'; {{ hello }} END_OF_TEXT my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; use Data::Dumper; my $lbrace_count; #**DECLARE VARIABLE** } brace_block: lbrace(1..) { $lbrace_count = @{$item[1]}; #SET VARIABLE** } 'hello' rbraces { say "rbraces matched: $item{rbraces}"; } lbrace: / [{] /xms rbraces: / [}]{$lbrace_count} /xms #**INTERPOLATE VARIABLE** END_OF_GRAMMAR

In reply to Re: Why won't this basic Parse::RecDescent example work? by 7stud
in thread Why won't this basic Parse::RecDescent example work? by 7stud

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-03-29 11:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found