Re: Parse syntactically analyzed sentence

It has been a very long time since I last did anything with Parse::RecDescent, so it took me longer than I'd like to admit to come up with the grammar that works. (It's especially humbling for me in this case, because I recognize, and have often played with, the sort of data you've got here: Penn Treebank.)

So here's a grammar that does what you seem to want:

start: tree
tree: '(' treestr(s) ')'
treestr: tree | tagstr
tagstr: TAG ( tree | word )
TAG: /[A-Z.]+ /
word: /[\w?]+/
[download]

Note the "(s)" modifier on the first mention of the "treestr" rule -- the start contains one tree (one set of parens will bound the entire string), but within that one tree you can find one or more subtrees. The OP grammar stopped at the end of the first subtree because it couldn't handle the sister tree that followed it.

There's probably something I'm not understanding just now about using parens (for grouping) and vertical bars (for alternations) in the grammar spec, and it's likely that there are other (less cumbersome) ways to define the grammar for data of this type.

Anyway, the grammar above does work its way to the end of your test string (though perhaps you want a different sort of data structure as the result, in which case, I apologize -- good luck with that).

I also noticed from the P::RD man page that you can pass a reference to a scalar containing the string to be parsed. Portions of the string will be removed as the parser works through it, so if you get back less of a structure than you expect, you can look at the string to see where the parsing stopped (due to failure to match any rules). Here's my version of your code:

#!/usr/bin/perl

use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;

$::RD_AUTOACTION = q { [@item] };
# $::RD_HINT = 1;

my $grammar= q {
start: tree
tree: '(' treestr(s) ')'
treestr: tree | tagstr
tagstr: TAG ( tree | word )
TAG: /[A-Z.]+ /
word: /[\w?!.]+/
};
my $parser=Parse::RecDescent->new($grammar);
my $text = "(SBARQ (WHNP (WP What))(SQ (VBZ is)(NP (NNP Head)(NNP Star
+t)))(. ?))";
my $result = $parser->start( \$text );
print $text, "\n";
print Dumper($result);
[download]

(UPDATE: I have the "HINT" setting commented out because it wasn't all that helpful.)

Another update: you probably would have figured this out, but the ". ?" string really should be treated as a "TAG word" pair, which is what my version of the grammar does. The "." is a generic "TAG" label for (strings of?) punctuation, and the "?" in this case represents the actual token that occurred in the text. Other sentences, ending with other punctuation marks, would have ". ." or ". !", etc. The rule for TAG also absorbs the space that must follow the TAG token.

Added "!." to the rule for "word" - might need to add more punctuation once you start getting into more varied sentences.

Comment on Re: Parse syntactically analyzed sentence Select or Download Code


P is for Practical
	PerlMonks