jcwren has asked for the wisdom of the Perl Monks concerning the following question:

I've been playing around with Parse::RecDescent, and have encountered something that I'm unclear on what the correct behavior should be.

The code fragment below represents a portion of a grammar for an assembler. There are two instructions, RET and RETI. If RET is specified before RETI in the grammar, RETI is not recognized by the parser. However, reversing them solves the problem, causing both to be recognized.

As I understand Parse::RecDescent, the specification of the grammar should not be order dependent. I believe this for several reasons. 1) I never saw it in the documentation (though I could have missed it); 2) One of the whole points of a recursive descent parser is cases just like this; 3) If one has a machine generated grammar, it could be very difficult if not impossible to determine a prerequired order of the grammar statements.

My question is: Am I mis-understanding a requirement of the grammar specification, or have I run across a bug in Parse::RecDescent (the former more like, the latter less so. After all, it's Damianware!)
#!/usr/local/bin/perl -ws use strict; use Parse::RecDescent; { $| = 1; my $grammar; { local $/; $grammar = <DATA>; } my $parser = Parse::RecDescent->new ($grammar); print ">"; while (<>) { print $parser->main ($_), ">" || die "Bad Code"; } } __DATA__ main: /^\s*\Z/ | input_line EOL | <error> EOL: /\s*\Z/ input_line: opcode_8031 # # The opcode code list. Reverse these two to fix, or RET/RETI orderi +ng to break. # opcode_8031: opcode_ret | opcode_reti opcode_ret: OP_RET opcode_reti: OP_RETI # # 8031 opcodes # OP_RET: 'ret' OP_RETI: 'reti'
--Chris

e-mail jcwren

Replies are listed 'Best First'.
Re: Parse::RecDescent and grammar specification
by chromatic (Archbishop) on Nov 05, 2000 at 08:43 UTC
    A recursive descent parser, as I understand it, doesn't match the longest token first, it matches the first possibility first.

    In this case, with simple lines composed of only one literal apiece, you'll either have to specify the longest token first, or rewrite the grammar slightly:

    main: /^\s*\Z/ | input_line | <error> EOL: /\s*\Z/ input_line: opcode_8031 | <error> # # The opcode code list. Reverse these two to fix, or RET/RETI orderi +ng to break. # opcode_8031: opcode_ret EOL | opcode_reti EOL opcode_ret: OP_RET { print "Got ret!\n" } opcode_reti: OP_RETI { print "Got reti!\n" } # # 8031 opcodes # OP_RET: 'ret' OP_RETI: 'reti'
    The important change is that EOL moves from input_line to the opcode_8031 production. The code blocks in there are for debugging purposes, to make sure what I thought would match does.

    Half the trick in a grammar is figuring out where to put the blank spaces. Especially the ones that aren't really blank.

Re: Parse::RecDescent and grammar specification
by metaperl (Curate) on Nov 05, 2000 at 17:36 UTC
    Chromatic answered: In this case, with simple lines composed of only one literal apiece, you'll either have to specify the longest token first, or rewrite the grammar slightly:

    This is not entirely true. If you look at the docs for Parse::RecDescent you will see that what he says is that the "best" match is chosen, and this will be the first successful match in an alternation unless you decide to use a scored grammar in which case you suffix each rule with a score directive and all alternatives are tried and the one with the best score wins.

    So, in his case, he could have a simple score directive like so:

    opcode: /match_text/ <score: { length join '' @item}>
    and append such lines in any order for opcode and instead of the alternatives being chosen based on order within alternation, they would be chosen based on how long a string they produced, @item being a list of the things with matched for this production.

    Just look for the section "Scored productions" in the .pod documentation.

RE: Parse::RecDescent and grammar specification
by Anonymous Monk on Nov 06, 2000 at 14:51 UTC
    Hi
    This is a question rather than a reply. I'm wanting to write a prog to convert some (straightforward) assembly code into the compiled byte strings - would using this Parse::RecDescent be something which would make my task easier or is it not really applicable ? I can't make my mind up from reading the docs.

    Thanks