in reply to Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)
Your Marpa grammar has a problem. Marpa's Scanless Interface uses a two-level grammar: a simple and efficient grammar for the lexer which breaks up the source into tokens for the more versatile high-level grammar. Your grammar currently only uses the high-level interface, which leads to an incredible amount of ambiguity.
Rules in the lexer grammar are not declared with “::=” but with “~”. A rule declared in this way can either be used as a “terminal symbol” in the high-level grammar, or in another low-level rule. Let's rewrite your grammar accordingly. As a naming convention, I used CamelCase for rules in the high-level grammar, ALL_UPPERCASE for terminal symbols in the high-level grammar, and snake_case for other rules in the low-level grammar.
inaccessible is fatal by default lexeme default = latm => 1 :start ::= PlistFile PlistFile ::= VersionData Ows GlobalPlists VersionData ::= 'Version' WS FLOAT Ows ';' GlobalPlists ::= GlobalPlist+ GlobalPlist ::= GLOBAL_PL_DECLARE WS PL_NAME Ows OptPlOptions Ows '{' + Ows OptEmbeddedBase Ows Nodes '}' Ows OptPlOptions ::= Option* Option ::= '[' Ows OPTION_DATA Ows ']' Ows OptEmbeddedBase ::= EmbeddedBase* EmbeddedBase ::= '#' Ows 'base' Ows '=' Ows BaseNumbers BaseNumbers ::= BASE_NUMBER+ separator => COMMA Nodes ::= Node+ Node ::= Pattern | COMMENT | GlobalPlist || ReferencePlist Pattern ::= PAT_DECLARE WS PAT_NAME Ows OptPatOption ';' Ows OptT +agStr Ows OptPatOption ::= Option OptPatOption ::= OptTagStr ::= TagStr OptTagStr ::= TagStr ::= '#' Ows TagList Ows '#' TagList ::= TAG* separator => COMMA ReferencePlist ::= 'PList' WS RefPlName Ows ';' Ows RefPlName ::= OptRefFile PL_NAME OptRefFile ::= RefFile* RefFile ::= FILE_NAME ':' Ows ::= WS # a lexeme cannot have zero length, Ows ::= # so optional whitespace must be a high-level grammar feat +ure WS ~ ws COMMA ~ ',' COMMENT ~ '#' comment_chars [\n] | '#' comment_chars [\n] ws FLOAT ~ int | int '.' int BASE_NUMBER ~ int PL_NAME ~ identifier TAG ~ identifier PAT_NAME ~ identifier PAT_DECLARE ~ 'Pat' | 'Pattern' GLOBAL_PL_DECLARE ~ 'GlobalPList' | 'LocalPList' | 'PatternList' FILE_NAME ~ [\w.]+ OPTION_DATA ~ [\w \.,]* ws ~ [\s]+ identifier ~ [\w]+ comment_chars ~ [^\n]+ int ~ [\d]+
Notice also that a “*” quantifier repeats a rule, instead of making it optional. If you want to signal that a rule is optional, then add an empty production Rule ::=. Lexemes cannot have zero length, and must always consume characters.
As this grammar is tidied up and shoves as much as possible into the more efficient low-level grammar, it should also use less memory. However, two problems remain:
There is some ambiguity between comments and base specifications or tag strings. This might lead to false parses, and is a source of inefficiency. You can reduce this by making “#” an illegal character inside comments.
You specify all whitespace manually. You could also make whitespace in the high-level rules implicit, via a :discard lexeme. This will increase efficiency as the high-level grammar has to deal with fewer symbols. While you can still require whitespace explicitly, whitespace would then be allowed between any rules.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)
by tj_thompson (Monk) on May 06, 2014 at 23:21 UTC | |
|
Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)
by tj_thompson (Monk) on May 06, 2014 at 22:39 UTC |