comment on

Your Marpa grammar has a problem. Marpa's Scanless Interface uses a two-level grammar: a simple and efficient grammar for the lexer which breaks up the source into tokens for the more versatile high-level grammar. Your grammar currently only uses the high-level interface, which leads to an incredible amount of ambiguity.

Rules in the lexer grammar are not declared with “::=” but with “~”. A rule declared in this way can either be used as a “terminal symbol” in the high-level grammar, or in another low-level rule. Let's rewrite your grammar accordingly. As a naming convention, I used CamelCase for rules in the high-level grammar, ALL_UPPERCASE for terminal symbols in the high-level grammar, and snake_case for other rules in the low-level grammar.

inaccessible is fatal by default
lexeme default = latm => 1

:start       ::= PlistFile
PlistFile    ::= VersionData Ows GlobalPlists
VersionData  ::= 'Version' WS FLOAT Ows ';'
GlobalPlists ::= GlobalPlist+
GlobalPlist  ::= GLOBAL_PL_DECLARE WS PL_NAME Ows OptPlOptions Ows '{'
+ Ows OptEmbeddedBase Ows Nodes '}' Ows

OptPlOptions ::= Option*
Option       ::= '[' Ows OPTION_DATA Ows ']' Ows

OptEmbeddedBase ::= EmbeddedBase*
EmbeddedBase    ::= '#' Ows 'base' Ows '=' Ows BaseNumbers
BaseNumbers     ::= BASE_NUMBER+ separator => COMMA

Nodes ::= Node+
Node  ::= Pattern | COMMENT | GlobalPlist || ReferencePlist

Pattern      ::= PAT_DECLARE WS PAT_NAME Ows OptPatOption ';' Ows OptT
+agStr Ows
OptPatOption ::= Option
OptPatOption ::=
OptTagStr    ::= TagStr
OptTagStr    ::=
TagStr       ::= '#' Ows TagList Ows '#'
TagList      ::= TAG* separator => COMMA

ReferencePlist ::= 'PList' WS RefPlName Ows ';' Ows
RefPlName      ::= OptRefFile PL_NAME
OptRefFile     ::= RefFile*
RefFile        ::= FILE_NAME ':'

Ows  ::= WS # a lexeme cannot have zero length,
Ows  ::=    # so optional whitespace must be a high-level grammar feat
+ure
WS                ~ ws
COMMA             ~ ','
COMMENT           ~ '#' comment_chars [\n]
                  | '#' comment_chars [\n] ws
FLOAT             ~ int | int '.' int
BASE_NUMBER       ~ int
PL_NAME           ~ identifier
TAG               ~ identifier
PAT_NAME          ~ identifier
PAT_DECLARE       ~ 'Pat' | 'Pattern'
GLOBAL_PL_DECLARE ~ 'GlobalPList' | 'LocalPList' | 'PatternList'
FILE_NAME         ~ [\w.]+
OPTION_DATA       ~ [\w \.,]*

ws            ~ [\s]+
identifier    ~ [\w]+
comment_chars ~ [^\n]+
int           ~ [\d]+
[download]

Notice also that a “*” quantifier repeats a rule, instead of making it optional. If you want to signal that a rule is optional, then add an empty production Rule ::=. Lexemes cannot have zero length, and must always consume characters.

As this grammar is tidied up and shoves as much as possible into the more efficient low-level grammar, it should also use less memory. However, two problems remain:

There is some ambiguity between comments and base specifications or tag strings. This might lead to false parses, and is a source of inefficiency. You can reduce this by making “#” an illegal character inside comments.
You specify all whitespace manually. You could also make whitespace in the high-level rules implicit, via a :discard lexeme. This will increase efficiency as the high-level grammar has to deal with fewer symbols. While you can still require whitespace explicitly, whitespace would then be allowed between any rules.

In reply to Re: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by amon
in thread Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by tj_thompson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.