in reply to Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)

Your Marpa grammar has a problem. Marpa's Scanless Interface uses a two-level grammar: a simple and efficient grammar for the lexer which breaks up the source into tokens for the more versatile high-level grammar. Your grammar currently only uses the high-level interface, which leads to an incredible amount of ambiguity.

Rules in the lexer grammar are not declared with “::=” but with “~”. A rule declared in this way can either be used as a “terminal symbol” in the high-level grammar, or in another low-level rule. Let's rewrite your grammar accordingly. As a naming convention, I used CamelCase for rules in the high-level grammar, ALL_UPPERCASE for terminal symbols in the high-level grammar, and snake_case for other rules in the low-level grammar.

inaccessible is fatal by default lexeme default = latm => 1 :start ::= PlistFile PlistFile ::= VersionData Ows GlobalPlists VersionData ::= 'Version' WS FLOAT Ows ';' GlobalPlists ::= GlobalPlist+ GlobalPlist ::= GLOBAL_PL_DECLARE WS PL_NAME Ows OptPlOptions Ows '{' + Ows OptEmbeddedBase Ows Nodes '}' Ows OptPlOptions ::= Option* Option ::= '[' Ows OPTION_DATA Ows ']' Ows OptEmbeddedBase ::= EmbeddedBase* EmbeddedBase ::= '#' Ows 'base' Ows '=' Ows BaseNumbers BaseNumbers ::= BASE_NUMBER+ separator => COMMA Nodes ::= Node+ Node ::= Pattern | COMMENT | GlobalPlist || ReferencePlist Pattern ::= PAT_DECLARE WS PAT_NAME Ows OptPatOption ';' Ows OptT +agStr Ows OptPatOption ::= Option OptPatOption ::= OptTagStr ::= TagStr OptTagStr ::= TagStr ::= '#' Ows TagList Ows '#' TagList ::= TAG* separator => COMMA ReferencePlist ::= 'PList' WS RefPlName Ows ';' Ows RefPlName ::= OptRefFile PL_NAME OptRefFile ::= RefFile* RefFile ::= FILE_NAME ':' Ows ::= WS # a lexeme cannot have zero length, Ows ::= # so optional whitespace must be a high-level grammar feat +ure WS ~ ws COMMA ~ ',' COMMENT ~ '#' comment_chars [\n] | '#' comment_chars [\n] ws FLOAT ~ int | int '.' int BASE_NUMBER ~ int PL_NAME ~ identifier TAG ~ identifier PAT_NAME ~ identifier PAT_DECLARE ~ 'Pat' | 'Pattern' GLOBAL_PL_DECLARE ~ 'GlobalPList' | 'LocalPList' | 'PatternList' FILE_NAME ~ [\w.]+ OPTION_DATA ~ [\w \.,]* ws ~ [\s]+ identifier ~ [\w]+ comment_chars ~ [^\n]+ int ~ [\d]+

Notice also that a “*” quantifier repeats a rule, instead of making it optional. If you want to signal that a rule is optional, then add an empty production Rule ::=. Lexemes cannot have zero length, and must always consume characters.

As this grammar is tidied up and shoves as much as possible into the more efficient low-level grammar, it should also use less memory. However, two problems remain:

  • Comment on Re: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)
by tj_thompson (Monk) on May 06, 2014 at 23:21 UTC

    Your changes were certainly effective. The 2MB file I was parsing previously took 33+GB of memory to parse. Your updated changes require ~1GB.

    I also tried the :discard lexeme to remove the Ows references. This worked on the first attempt, so I was clearly doing something wrong the first time around. This reduced the memory usage further to ~500MB and sped up the parse time. It now sits at about 15s vs. the old parser's 9s.

    Many thanks for the insight. Now I get to figure out how to get the data back from the parse :)

Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)
by tj_thompson (Monk) on May 06, 2014 at 22:39 UTC

    I had little doubt my Marpa grammar had issues. I'm still trying to wrap my head around lexer vs. parser and how Marpa's two levels interact. I figured the main difference was captures vs. non capture in my reading. I didn't realize there was a performance impact. I'm still having a very difficult time understanding the differences between the G0/L0 rules.

    I actually tried to use the :discard lexeme, but I found that I couldn't get the required newline after a partial line comment to match properly after this. They seemed to be discarded before they could match. Maybe I'll have another look.

    Thanks for the input, it's greatly appreciated :) I'll look over your changes and play with the code to see what I can learn.