Re: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)

Your Marpa grammar has a problem. Marpa's Scanless Interface uses a two-level grammar: a simple and efficient grammar for the lexer which breaks up the source into tokens for the more versatile high-level grammar. Your grammar currently only uses the high-level interface, which leads to an incredible amount of ambiguity.

Rules in the lexer grammar are not declared with “::=” but with “~”. A rule declared in this way can either be used as a “terminal symbol” in the high-level grammar, or in another low-level rule. Let's rewrite your grammar accordingly. As a naming convention, I used CamelCase for rules in the high-level grammar, ALL_UPPERCASE for terminal symbols in the high-level grammar, and snake_case for other rules in the low-level grammar.

inaccessible is fatal by default
lexeme default = latm => 1

:start       ::= PlistFile
PlistFile    ::= VersionData Ows GlobalPlists
VersionData  ::= 'Version' WS FLOAT Ows ';'
GlobalPlists ::= GlobalPlist+
GlobalPlist  ::= GLOBAL_PL_DECLARE WS PL_NAME Ows OptPlOptions Ows '{'
+ Ows OptEmbeddedBase Ows Nodes '}' Ows

OptPlOptions ::= Option*
Option       ::= '[' Ows OPTION_DATA Ows ']' Ows

OptEmbeddedBase ::= EmbeddedBase*
EmbeddedBase    ::= '#' Ows 'base' Ows '=' Ows BaseNumbers
BaseNumbers     ::= BASE_NUMBER+ separator => COMMA

Nodes ::= Node+
Node  ::= Pattern | COMMENT | GlobalPlist || ReferencePlist

Pattern      ::= PAT_DECLARE WS PAT_NAME Ows OptPatOption ';' Ows OptT
+agStr Ows
OptPatOption ::= Option
OptPatOption ::=
OptTagStr    ::= TagStr
OptTagStr    ::=
TagStr       ::= '#' Ows TagList Ows '#'
TagList      ::= TAG* separator => COMMA

ReferencePlist ::= 'PList' WS RefPlName Ows ';' Ows
RefPlName      ::= OptRefFile PL_NAME
OptRefFile     ::= RefFile*
RefFile        ::= FILE_NAME ':'

Ows  ::= WS # a lexeme cannot have zero length,
Ows  ::=    # so optional whitespace must be a high-level grammar feat
+ure
WS                ~ ws
COMMA             ~ ','
COMMENT           ~ '#' comment_chars [\n]
                  | '#' comment_chars [\n] ws
FLOAT             ~ int | int '.' int
BASE_NUMBER       ~ int
PL_NAME           ~ identifier
TAG               ~ identifier
PAT_NAME          ~ identifier
PAT_DECLARE       ~ 'Pat' | 'Pattern'
GLOBAL_PL_DECLARE ~ 'GlobalPList' | 'LocalPList' | 'PatternList'
FILE_NAME         ~ [\w.]+
OPTION_DATA       ~ [\w \.,]*

ws            ~ [\s]+
identifier    ~ [\w]+
comment_chars ~ [^\n]+
int           ~ [\d]+
[download]

Notice also that a “*” quantifier repeats a rule, instead of making it optional. If you want to signal that a rule is optional, then add an empty production Rule ::=. Lexemes cannot have zero length, and must always consume characters.

As this grammar is tidied up and shoves as much as possible into the more efficient low-level grammar, it should also use less memory. However, two problems remain:

There is some ambiguity between comments and base specifications or tag strings. This might lead to false parses, and is a source of inefficiency. You can reduce this by making “#” an illegal character inside comments.
You specify all whitespace manually. You could also make whitespace in the high-level rules implicit, via a :discard lexeme. This will increase efficiency as the high-level grammar has to deal with fewer symbols. While you can still require whitespace explicitly, whitespace would then be allowed between any rules.

Comment on Re: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) Select or Download Code

Replies are listed 'Best First'.
Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by tj_thompson (Monk) on May 06, 2014 at 23:21 UTC
Your changes were certainly effective. The 2MB file I was parsing previously took 33+GB of memory to parse. Your updated changes require ~1GB. I also tried the :discard lexeme to remove the Ows references. This worked on the first attempt, so I was clearly doing something wrong the first time around. This reduced the memory usage further to ~500MB and sped up the parse time. It now sits at about 15s vs. the old parser's 9s. Many thanks for the insight. Now I get to figure out how to get the data back from the parse :)	[reply]
Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by tj_thompson (Monk) on May 06, 2014 at 22:39 UTC
I had little doubt my Marpa grammar had issues. I'm still trying to wrap my head around lexer vs. parser and how Marpa's two levels interact. I figured the main difference was captures vs. non capture in my reading. I didn't realize there was a performance impact. I'm still having a very difficult time understanding the differences between the G0/L0 rules. I actually tried to use the :discard lexeme, but I found that I couldn't get the required newline after a partial line comment to match properly after this. They seemed to be discarded before they could match. Maybe I'll have another look. Thanks for the input, it's greatly appreciated :) I'll look over your changes and play with the code to see what I can learn.	[reply]