gmarler has asked for the wisdom of the Perl Monks concerning the following question:

Finally spending time working on a Parse::RecDescent grammar, but am having trouble with low level productions that are very similar, so the wrong one is often picked, causing the parse to fail. Reading the docs for the module indicates that maybe using <score: ...> might be useful, but I'm not clear on how I could take advantage of it.

The issue seems to be that the statements I'm trying to parse take one of three forms:

Here's example code I've got, with __DATA__ at the end:

use strict; use warnings; use Parse::RecDescent; use Data::Dumper; my $grammar = <<'EOG' <autotree> VCSConfig: statement(s) statement: clause | def clause: "include" pathname # Include Clause # NOTE: May not have any attributes... def: "cluster" name "(" Attr(s?) ")" | "system" name "(" Attr(s?) ")" # Pathname may or may not be surrounded by double quotes pathname: dquote(?) /([^"]+)/ dquote(?) { $return = $1; } dquote: /"/ name: /\w+/ Attr: AttrScalar(s?) | AttrKeyList(s?) | AttrAssociation(s?) AttrScalar: attribute '=' string AttrKeyList: attribute '=' keylist AttrAssociation: attribute '=' association attribute: /[a-zA-Z][\w@]+/ # allow '@' in attr name # NOTE: separator can be either of ',' or ';' keylist: '{' <leftop: string /[,;]/ string> '}' association: '{' <leftop: key_value /[,;]/ key_value> '}' key_value: string '=' string string: /[a-zA-Z]\w+/ EOG my ($vcs_config); my ($vcs_parse) = Parse::RecDescent->new( $grammar ); my ($vcs_config) = do { local $/; <DATA>; }; my ($orig_config) = $vcs_parse->VCSConfig( $vcs_config ); print Dumper $orig_config; __DATA__ include "types.cf" include "LBSybase.cf" include "OracleTypes.cf" cluster vcs ( UserNames = { vcs = X1Nh6WIWs6ATQ } Administrators = { vcs } CounterInterval = 5 ) system njengsunvcs1 ( ) system njengsunvcs2 ( )

The include clauses are parsed with no problem, but as soon as I hit the cluster clause, everything starts to break down, because I can't figure out how to get the grammar to properly differentiate between the Association, KeyList, and Scalar assignments within that clause.

Would the <score: ...> directive help me here? Or is there a much simpler way to get the grammar in line?

Note that this is just a small snippet of the config file in my example - the actual file I'm trying to parse is hundreds of lines long and has several other clause types, but they all have the same attribute types I'm trying to parse here - so this isn't really an easy regex problem either.

Replies are listed 'Best First'.
Re: Parse::RecDescent Grammar Questions
by ikegami (Patriarch) on May 20, 2008 at 20:45 UTC
    • I can't figure out how to get the grammar to properly differentiate between the Association, KeyList, and Scalar assignments within that clause.

      You shouldn't have to. The parser will try them all until it finds one that succeeds. The problem is that you wrote it so the first production of Attr (AttrScalar(s?)) will always succeed, so it'll never get to try the 2nd and 3rd productions (AttrKeyList(s?) and AttrAssociation(s?)).

      Instead of telling the parser to search for
      0 or more of {one of {0 or more AttrScalar} or {0 or more AttrKeyList} or {0 or more AttrAssociation}}
      you should be asking for
      0 or more of {one of AttrScalar or AttrKeyList or AttrAssociation}

      In other words,
      Attr: AttrScalar(s?) | AttrKeyList(s?) | AttrAssociation(s?)
      should be
      Attr: AttrScalar | AttrKeyList | AttrAssociation

    • Your second problem is that "5" in "CounterInterval = 5" doesn't match "string".

    • You never check if you've reached the end of the string. That's why it returned a parse tree even though it was incomplete.
      VCSConfig: statement(s)
      should be
      VCSConfig: statement(s) /\Z/

    • Are you sure that identifiers can't be one character long?
      /[a-zA-Z][\w@]+/
      should be
      /[a-zA-Z][\w@]*/
      and
      /[a-zA-Z]\w+/
      should be
      /[a-zA-Z]\w*/

    • It's bad to separate a token into multiple rules. It causes characters to be removed. (See <skip.)
      pathname:  dquote(?) /([^"]+)/ dquote(?) { $item[1] }
      should be
      pathname: /"([^"]+)"/ { dequote($item[1]) }
              | /([^"]+)/ { $item[1] }

    • Attr: AttrScalar | AttrKeyList | AttrAssociation
      is *very* inefficient because all three subrules start with "attribute '='".

    • "cluster" name
      will see the following string as valid
      clusterpeanut
      You normally want to force a space in there. One way is to match any identifier, than require the identifier to be "cluster".
      This problem occurs in a few other places too.

    • It's very useful to uppercase tokens and keep them separate. They look similar to other rules, but you'll find that you'll be treating them a little special.

    • I find it much more readable to line up the : and the | of all the rules.

    make_parser.pl, generates the parser. Run it to create VCSConfigParser.pm.

    use strict; use warnings; use Parse::RecDescent qw( ); my $grammar = <<'EOG'; <autotree> { # These affect the entire parser. use strict; use warnings; sub dequote { my $s = $_[0]; $s =~ s/^"//; $s =~ s/"\z//; return $s; } } parse : stmt(s) /\Z/ stmt : clause | def clause : "include" pathname # Pathname may or may not be surrounded by double quotes pathname : STRING | BAREWORD def : IDENT def_[ $item[1] ] { $item[2] } def_ : { $arg[0] eq "cluster" ?1:0 } IDENT "(" attr(s?) ")" | { $arg[0] eq "system" ?1:0 } IDENT "(" attr(s?) ")" attr : ATTRNAME '=' attr_val attr_val : ident | string | number | key_list | assoc_list val : ident | string | number # These aren't inlined because of <autotree> ident : IDENT string : STRING number : NUMBER key_list : '{' <leftop: IDENT /[,;]/ IDENT> '}' assoc_list : '{' <leftop: key_value /[,;]/ key_value> '}' key_value : IDENT '=' val # === Tokens === IDENT : /[a-zA-Z]\w*/ { $item[1] } ATTRNAME : /[a-zA-Z][\w@]*/ { $item[1] } STRING : /"(?:[^"]+)"/ { dequote($item[1]) } NUMBER : /\d+/ { $item[1] } # Need work. BAREWORD : /(?:[^"]+)/ { $item[1] } EOG Parse::RecDescent->Precompile($grammar, 'VCSConfigParser') or die("Bad grammar\n");

    test.pl, a sample program that uses the parser.

    use strict; use warnings; use VCSConfigParser qw( ); use Data::Dumper qw( Dumper ); #$::RD_TRACE = ''; my $vcs_parser = VCSConfigParser->new(); my $vcs_config = do { local $/; <DATA> }; my $tree = $vcs_parser->parse( $vcs_config ); print Dumper $tree; __DATA__ include "types.cf" include "LBSybase.cf" include "OracleTypes.cf" cluster vcs ( UserNames = { vcs = X1Nh6WIWs6ATQ } Administrators = { vcs } CounterInterval = abc ) system njengsunvcs1 ( ) system njengsunvcs2 ( )

    Notes:

    • def + def_ is an optimization of

      def : IDENT { $item[1] eq "cluster" ?1:0 } IDENT "(" attr(s?) ")" | IDENT { $item[1] eq "system" ?1:0 } IDENT "(" attr(s?) ")"

      By eliminating the common prefix of the productions, the parser is sped up.

    • If you want to allow nesting, change val to attr_val.

      Thanks very much - that's exactly what I was looking for. And very nice tips to boot.

      Still trying to wrap my head around exactly how this does it's job:

      def        : IDENTdef_[ $item[1] ]

      But I'll run it through with the -RD_TRACE flag a few times to figure it out...

      Now to build a sensible data structure out of all this parsed data - gotta think about this a bit.

        It passes an argument ($item[1], which contains what IDENT matched) to def_, which can access it via @arg.
Re: Parse::RecDescent Grammar Questions
by pc88mxer (Vicar) on May 20, 2008 at 20:17 UTC
    If the problem is disambiguating AttrKeyList and AttrAssociation, how about simplifying the grammar by eliminating the AttrKeyList rule and augmenting key_value to include just a plain string:
    key_value: string '=' string | string
    This will allow things like:
    UserNames = { vcs = X1Nh6WIWs6ATQ, foo }
    which wasn't legal in your original grammar. However, you can always check for a proper AttrKeyList or AttrAssociation after you've parsed the input.

      If the problem is disambiguating AttrKeyList and AttrAssociation

      It's not. They are already unambiguous.

Re: Parse::RecDescent Grammar Questions
by psini (Deacon) on May 20, 2008 at 20:21 UTC

    Second try :)

    Seems to me that this rules:

    keylist: '{' <leftop: string /[,;]/ string> '}' association: '{' <leftop: key_value /[,;]/ key_value> '}'

    will match only keylists and associations of length==2. Should not they be:

    keylist: '{' <leftop: string (/[,;]/ string>)(s?) '}' association: '{' <leftop: key_value (/[,;]/ key_value>)(s?) '}'

    to match groups of any length?

    Rule One: Do not act incautiously when confronting a little bald wrinkly smiling man.

      No.

      <leftop: key_value /[,;]/ key_value>

      is equivalent to

      ( key_value (/[,;]/ key_value)(s?) { [ $item[1], @{$item[2]} ] } )

      The "(s?)" is already built in.

Re: Parse::RecDescent Grammar Questions
by psini (Deacon) on May 20, 2008 at 20:11 UTC

    I never used Parse::RecDescent but I think that line:

    association: '{' <leftop: key_value /[,;]/ key_value> '}'

    should be

    association: '{' <leftop: key_value '=' key_value> '}'

    Update: No, sorry, I didn't understand the context :(.

    Rule One: Do not act incautiously when confronting a little bald wrinkly smiling man.

Re: Parse::RecDescent Grammar Questions
by pc88mxer (Vicar) on May 20, 2008 at 20:54 UTC
    I think I understand what your question is. Does it have to do with this set of rules:
    def: "cluster" name "(" Attr(s?) ")" | "system" name "(" Attr(s?) ")" Attr: AttrScalar(s?) | AttrKeyList(s?) | AttrAssociation(s?)
    and do you really want the following:
    def: "cluster" name "(" Attr(s?) ")" | "system" name "(" Attr(s?) ")" Attr: AttrScalar | AttrKeyList | AttrAssociation
    Otherwise I wouldn't be surprised if the parser sees a reduce-reduce conflict.
      Isn't the expression "reduce-reduce conflict" exclusive to LR parsers? P::RD is a LL parser.
        Yeah, you're probably right. In this case Parse::RecDescent probably resolves it by choosing one without informing you that there are other options - just my guess, though.