Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: Parse syntactically analyzed sentence

by graff (Chancellor)
on May 08, 2016 at 01:15 UTC ( #1162452=note: print w/replies, xml ) Need Help??

in reply to Parse syntactically analyzed sentence

It has been a very long time since I last did anything with Parse::RecDescent, so it took me longer than I'd like to admit to come up with the grammar that works. (It's especially humbling for me in this case, because I recognize, and have often played with, the sort of data you've got here: Penn Treebank.)

So here's a grammar that does what you seem to want:

start: tree tree: '(' treestr(s) ')' treestr: tree | tagstr tagstr: TAG ( tree | word ) TAG: /[A-Z.]+ / word: /[\w?]+/
Note the "(s)" modifier on the first mention of the "treestr" rule -- the start contains one tree (one set of parens will bound the entire string), but within that one tree you can find one or more subtrees. The OP grammar stopped at the end of the first subtree because it couldn't handle the sister tree that followed it.

There's probably something I'm not understanding just now about using parens (for grouping) and vertical bars (for alternations) in the grammar spec, and it's likely that there are other (less cumbersome) ways to define the grammar for data of this type.

Anyway, the grammar above does work its way to the end of your test string (though perhaps you want a different sort of data structure as the result, in which case, I apologize -- good luck with that).

I also noticed from the P::RD man page that you can pass a reference to a scalar containing the string to be parsed. Portions of the string will be removed as the parser works through it, so if you get back less of a structure than you expect, you can look at the string to see where the parsing stopped (due to failure to match any rules). Here's my version of your code:

#!/usr/bin/perl use strict; use warnings; use Parse::RecDescent; use Data::Dumper; $::RD_AUTOACTION = q { [@item] }; # $::RD_HINT = 1; my $grammar= q { start: tree tree: '(' treestr(s) ')' treestr: tree | tagstr tagstr: TAG ( tree | word ) TAG: /[A-Z.]+ / word: /[\w?!.]+/ }; my $parser=Parse::RecDescent->new($grammar); my $text = "(SBARQ (WHNP (WP What))(SQ (VBZ is)(NP (NNP Head)(NNP Star +t)))(. ?))"; my $result = $parser->start( \$text ); print $text, "\n"; print Dumper($result);
(UPDATE: I have the "HINT" setting commented out because it wasn't all that helpful.)

Another update: you probably would have figured this out, but the ". ?" string really should be treated as a "TAG word" pair, which is what my version of the grammar does. The "." is a generic "TAG" label for (strings of?) punctuation, and the "?" in this case represents the actual token that occurred in the text. Other sentences, ending with other punctuation marks, would have ". ." or ". !", etc. The rule for TAG also absorbs the space that must follow the TAG token.

Added "!." to the rule for "word" - might need to add more punctuation once you start getting into more varied sentences.

Replies are listed 'Best First'.
Re^2: Parse syntactically analyzed sentence
by nido203 (Novice) on May 08, 2016 at 09:00 UTC

    Wow, awesome! This is exactly what I was looking for. Thank you sir very much. Yes it came to my mind that ".?" should be treated as a "TAG word" but couldn't make it work somehow.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1162452]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2021-03-07 12:28 GMT
Find Nodes?
    Voting Booth?
    My favorite kind of desktop background is:

    Results (121 votes). Check out past polls.