Difficulties defining a grammar for Parse::RecDescent

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am having a problem with defining a grammar for Parse::RecDescent - The format which I am parsing is relative simple and consists of a textfile with opening tags defined as <MT\d+> (for example, <MT100> or <MT57>). These tags may be closed by one of two events:

encountering a slash, </code>/</code>, or newline, \n, or
encountering another opening tag

There may or may not be any information between the opening and closing of a tag, but the presence of such tags is still important.

So far I have been able to build the following grammar for Parse::RecDescent:

File:      Tag(s) EOF

Tag:
          | MT Element EOT
                { 
                  print "MT:  ", $item[1], "\n";
                  print "     ", $item[2], "\n";
                }
          | MT Element
                { 
                  print "MT:  ", $item[1], "\n";
                  print "     ", $item[2], "\n";
                }
          | MT EOT
                { 
                  print "MT:  ", $item[1], "\n"; 
                }
          | MT
                { 
                  print "MT:  ", $item[1], "\n"; 
                }
          | <error>

MT:         /<MT\d+>/

Element:    /[^\/\n]+/

EOF:        /^\Z/
EOT:        /\//
[download]

But the problem is that the Element token is too greedy and will eat up until the end of the line, swallowing any other tags. I have also tried to use the ...TOKEN construct within the grammar with no luck. What am I doing wrong?

Comment on Difficulties defining a grammar for Parse::RecDescent Select or Download Code

Replies are listed 'Best First'.
Re: Difficulties defining a grammar for Parse::RecDescent by Anonymous Monk on Jan 31, 2003 at 14:13 UTC
Next time try something like this when posting a question (or debugging) #!/usr/bin/perl -lw use strict; use Parse::RecDescent; $::RD_TRACE = 1; # dump a trace $::RD_ERRORS = 1; # Make sure the parser dies when it encounters an er +ror $::RD_WARN = 1; # Enable warnings. This will warn on unused rules &c +. $::RD_HINT = 1; # Give out hints to help fix problems. # universal token prefix pattern (default is: '\s') warn "TOKEN SEPARATOR: $Parse::RecDescent::skip \n"; my $grammar = <<'_EOGRAMMAR_'; File: Tag(s) EOF Tag: \| MT Element EOT { print "MT: ", $item[1], "\n"; print " ", $item[2], "\n"; } \| MT Element { print "MT: ", $item[1], "\n"; print " ", $item[2], "\n"; } \| MT EOT { print "MT: ", $item[1], "\n"; } \| MT { print "MT: ", $item[1], "\n"; } \| <error> MT: /<MT\d+>/ Element: /[^\/\n]+/ EOF: /^\Z/ EOT: /\// _EOGRAMMAR_ ### THIS IS THE END OF THE GOSH DURN GRAMMAR # Create and compile the source file my $parser = Parse::RecDescent->new($grammar) or die "Dang!! $!"; $parser->File(q[ <MT100> This SUCKS </MT100> <MT666> WHY? <MT066> CAUSE IT DOESN'T WORK <MT2> SO What you gonna DO?/NOW I dunno ]); __END__ TOKEN SEPARATOR: \s Parse::RecDescent: Treating "File:" as a rule declaration Parse::RecDescent: Treating "Tag(s)" as a one-or-more subrule matc +h Parse::RecDescent: Treating "EOF" as a subrule match Parse::RecDescent: Treating "Tag:" as a rule declaration Parse::RecDescent: Treating "\|" as a new production Parse::RecDescent: Treating "MT" as a subrule match Parse::RecDescent: Treating "Element" as a subrule match Parse::RecDescent: Treating "EOT" as a subrule match Parse::RecDescent: Treating "{ print "MT: ", $item[1], "\n"; print + " ", $item[2], "\n"; }" as an action Parse::RecDescent: Treating "\|" as a new production Parse::RecDescent: Treating "MT" as a subrule match Parse::RecDescent: Treating "Element" as a subrule match Parse::RecDescent: Treating "{ print "MT: ", $item[1], "\n"; print + " ", $item[2], "\n"; }" as an action Parse::RecDescent: Treating "\|" as a new production Parse::RecDescent: Treating "MT" as a subrule match Parse::RecDescent: Treating "EOT" as a subrule match Parse::RecDescent: Treating "{ print "MT: ", $item[1], "\n"; }" as + an action Parse::RecDescent: Treating "\|" as a new production Parse::RecDescent: Treating "MT" as a subrule match Parse::RecDescent: Treating "{ print "MT: ", $item[1], "\n"; }" as + an action Parse::RecDescent: Treating "\| <error" as a new (error) production Parse::RecDescent: Treating "<error>" as an error marker Parse::RecDescent: Treating "MT:" as a rule declaration Parse::RecDescent: Treating "/<MT\d+>/" as a /../ pattern terminal Parse::RecDescent: Treating "Element:" as a rule declaration Parse::RecDescent: Treating "/[^\/\n]+/" as a /../ pattern termina +l Parse::RecDescent: Treating "EOF:" as a rule declaration Parse::RecDescent: Treating "/^\Z/" as a /../ pattern terminal Parse::RecDescent: Treating "EOT:" as a rule declaration Parse::RecDescent: Treating "/\//" as a /../ pattern terminal \| File \|Trying rule: [File] \| \| File \| \|"\n<MT100> This SUC +KS \| \| \|</MT100>\n<MT666> W +HY? \| \| \|<MT066> CAUSE IT DO +ESN'T \| \| \|WORK\n<MT2> SO What + you gonna \| \| \|DO?/NOW\nI dunno\n" \| File \|Trying production: [Tag EOF] \| \| File \|Trying repeated subrule: [Tag] \| \| Tag \|Trying rule: [Tag] \| \| Tag \|Trying production: [] \| \| Tag \|>>Matched production: []<< \| \| Tag \|>>Matched rule<< (return value: [Tag])\| \| Tag \|(consumed: []) \| \| File \|>>Matched repeated subrule: [Tag]<< (1\| \| \|times) \| \| File \|Trying subrule: [EOF] \| \| EOF \|Trying rule: [EOF] \| \| EOF \|Trying production: [/^\Z/] \| \| EOF \|Trying terminal: [/^\Z/] \| \| EOF \|<<Didn't match terminal>> \| \| EOF \| \|"<MT100> This SUCKS \| \| \|</MT100>\n<MT666> W +HY? \| \| \|<MT066> CAUSE IT DO +ESN'T \| \| \|WORK\n<MT2> SO What + you gonna \| \| \|DO?/NOW\nI dunno\n" \| EOF \|<<Didn't match rule>> \| \| File \|<<Didn't match subrule: [EOF]>> \| \| File \|<<Didn't match rule>> \| printing code (31240) to RD_TRACE [download]	[reply] [d/l]
Re: Difficulties defining a grammar for Parse::RecDescent by castaway (Parson) on Jan 31, 2003 at 14:17 UTC
To make the Element regexp non-greedy, use: `Element: /[^<\/\n]+?/` [download] So that it picks up everything until it encounters a newline, slash or <, instead of up to the last one of these. (It will be a problem if an element can contain a < or slash though.) And I think your `<error>` text should be in quotes. C.	[reply] [d/l] [select]
Re: Difficulties defining a grammar for Parse::RecDescent by Anonymous Monk on Jan 31, 2003 at 14:25 UTC
What's the point of `Tag: \| MT Element EOT` [download] Why have an empty "production"?	[reply] [d/l]
Re: Difficulties defining a grammar for Parse::RecDescent by Jaap (Curate) on Jan 31, 2003 at 14:09 UTC
You might try to add a question mark for non-greediness of the + operator: `Element: /[^\/\n]+?/` [download] Hmmm.. this will probably not work without a terminator of some sort, depending on how ParseRecDescent implemented these regexp's	[reply] [d/l]