Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am having a problem with defining a grammar for Parse::RecDescent - The format which I am parsing is relative simple and consists of a textfile with opening tags defined as <MT\d+> (for example, <MT100> or <MT57>). These tags may be closed by one of two events:

There may or may not be any information between the opening and closing of a tag, but the presence of such tags is still important.

So far I have been able to build the following grammar for Parse::RecDescent:

File: Tag(s) EOF Tag: | MT Element EOT { print "MT: ", $item[1], "\n"; print " ", $item[2], "\n"; } | MT Element { print "MT: ", $item[1], "\n"; print " ", $item[2], "\n"; } | MT EOT { print "MT: ", $item[1], "\n"; } | MT { print "MT: ", $item[1], "\n"; } | <error> MT: /<MT\d+>/ Element: /[^\/\n]+/ EOF: /^\Z/ EOT: /\//

But the problem is that the Element token is too greedy and will eat up until the end of the line, swallowing any other tags. I have also tried to use the ...TOKEN construct within the grammar with no luck. What am I doing wrong?

Replies are listed 'Best First'.
Re: Difficulties defining a grammar for Parse::RecDescent
by Anonymous Monk on Jan 31, 2003 at 14:13 UTC
    Next time try something like this when posting a question (or debugging)
    #!/usr/bin/perl -lw use strict; use Parse::RecDescent; $::RD_TRACE = 1; # dump a trace $::RD_ERRORS = 1; # Make sure the parser dies when it encounters an er +ror $::RD_WARN = 1; # Enable warnings. This will warn on unused rules &c +. $::RD_HINT = 1; # Give out hints to help fix problems. # universal token prefix pattern (default is: '\s*') warn "TOKEN SEPARATOR: $Parse::RecDescent::skip \n"; my $grammar = <<'_EOGRAMMAR_'; File: Tag(s) EOF Tag: | MT Element EOT { print "MT: ", $item[1], "\n"; print " ", $item[2], "\n"; } | MT Element { print "MT: ", $item[1], "\n"; print " ", $item[2], "\n"; } | MT EOT { print "MT: ", $item[1], "\n"; } | MT { print "MT: ", $item[1], "\n"; } | <error> MT: /<MT\d+>/ Element: /[^\/\n]+/ EOF: /^\Z/ EOT: /\// _EOGRAMMAR_ ### THIS IS THE END OF THE GOSH DURN GRAMMAR # Create and compile the source file my $parser = Parse::RecDescent->new($grammar) or die "Dang!! $!"; $parser->File(q[ <MT100> This SUCKS </MT100> <MT666> WHY? <MT066> CAUSE IT DOESN'T WORK <MT2> SO What you gonna DO?/NOW I dunno ]); __END__ TOKEN SEPARATOR: \s* Parse::RecDescent: Treating "File:" as a rule declaration Parse::RecDescent: Treating "Tag(s)" as a one-or-more subrule matc +h Parse::RecDescent: Treating "EOF" as a subrule match Parse::RecDescent: Treating "Tag:" as a rule declaration Parse::RecDescent: Treating "|" as a new production Parse::RecDescent: Treating "MT" as a subrule match Parse::RecDescent: Treating "Element" as a subrule match Parse::RecDescent: Treating "EOT" as a subrule match Parse::RecDescent: Treating "{ print "MT: ", $item[1], "\n"; print + " ", $item[2], "\n"; }" as an action Parse::RecDescent: Treating "|" as a new production Parse::RecDescent: Treating "MT" as a subrule match Parse::RecDescent: Treating "Element" as a subrule match Parse::RecDescent: Treating "{ print "MT: ", $item[1], "\n"; print + " ", $item[2], "\n"; }" as an action Parse::RecDescent: Treating "|" as a new production Parse::RecDescent: Treating "MT" as a subrule match Parse::RecDescent: Treating "EOT" as a subrule match Parse::RecDescent: Treating "{ print "MT: ", $item[1], "\n"; }" as + an action Parse::RecDescent: Treating "|" as a new production Parse::RecDescent: Treating "MT" as a subrule match Parse::RecDescent: Treating "{ print "MT: ", $item[1], "\n"; }" as + an action Parse::RecDescent: Treating "| <error" as a new (error) production Parse::RecDescent: Treating "<error>" as an error marker Parse::RecDescent: Treating "MT:" as a rule declaration Parse::RecDescent: Treating "/<MT\d+>/" as a /../ pattern terminal Parse::RecDescent: Treating "Element:" as a rule declaration Parse::RecDescent: Treating "/[^\/\n]+/" as a /../ pattern termina +l Parse::RecDescent: Treating "EOF:" as a rule declaration Parse::RecDescent: Treating "/^\Z/" as a /../ pattern terminal Parse::RecDescent: Treating "EOT:" as a rule declaration Parse::RecDescent: Treating "/\//" as a /../ pattern terminal | File |Trying rule: [File] | | File | |"\n<MT100> This SUC +KS | | |</MT100>\n<MT666> W +HY? | | |<MT066> CAUSE IT DO +ESN'T | | |WORK\n<MT2> SO What + you gonna | | |DO?/NOW\nI dunno\n" | File |Trying production: [Tag EOF] | | File |Trying repeated subrule: [Tag] | | Tag |Trying rule: [Tag] | | Tag |Trying production: [] | | Tag |>>Matched production: []<< | | Tag |>>Matched rule<< (return value: [Tag])| | Tag |(consumed: []) | | File |>>Matched repeated subrule: [Tag]<< (1| | |times) | | File |Trying subrule: [EOF] | | EOF |Trying rule: [EOF] | | EOF |Trying production: [/^\Z/] | | EOF |Trying terminal: [/^\Z/] | | EOF |<<Didn't match terminal>> | | EOF | |"<MT100> This SUCKS | | |</MT100>\n<MT666> W +HY? | | |<MT066> CAUSE IT DO +ESN'T | | |WORK\n<MT2> SO What + you gonna | | |DO?/NOW\nI dunno\n" | EOF |<<Didn't match rule>> | | File |<<Didn't match subrule: [EOF]>> | | File |<<Didn't match rule>> | printing code (31240) to RD_TRACE
Re: Difficulties defining a grammar for Parse::RecDescent
by castaway (Parson) on Jan 31, 2003 at 14:17 UTC
    To make the Element regexp non-greedy, use:
    Element: /[^<\/\n]+?/
    So that it picks up everything until it encounters a newline, slash or <, instead of up to the last one of these. (It will be a problem if an element can contain a < or slash though.)
    And I think your <error> text should be in quotes.

    C.

Re: Difficulties defining a grammar for Parse::RecDescent
by Anonymous Monk on Jan 31, 2003 at 14:25 UTC
    What's the point of
    Tag: | MT Element EOT
    Why have an empty "production"?
Re: Difficulties defining a grammar for Parse::RecDescent
by Jaap (Curate) on Jan 31, 2003 at 14:09 UTC
    You might try to add a question mark for non-greediness of the + operator:
    Element: /[^\/\n]+?/
    Hmmm.. this will probably not work without a terminator of some sort, depending on how ParseRecDescent implemented these regexp's