Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars)

tj_thompson has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I'm attempting to parse a set of files. File size ranges from a few kB to multiple MB (< 10MB). My previous solution was line by line parsing but I've bumped into a number of cases where newlines in unexpected places (or not having them in expected places) breaks my parser but is legal syntax for the file being parsed. Here's a short example of what a typical file looks like. Note that some of the comments contain necessary data also, such as the 'base=111' and '#KEEP#' or '#HOT,COLD#' comments. Also, please excuse the difficult to read grammar. This was just a trial and error till it worked chunk of test code.

Version 1.0;

GlobalPList plist1 [option1] [option2] {
    #base=111
    # this is a full line comment
    Pat n1000000g0000001; # this is a partial line comment
    Pat n2000000g0000002; #HOT#

    LocalPList plist2 {
        Pat n5000000g0000005;
        GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } }

    Pat n3000000g0000003; #HOT,COLD#
    Pat n4000000g0000004;

    PList plist2;

    PList file1.plist:plist3;
}

GlobalPList plist2 [option3] {
    PList plist1;
    Pattern n6000000g0000006;
    Pattern n7000000g0000007;
}
[download]

I have written a grammar based parser in both Regexp::Grammars as well as Marpa::R2, but both seem to require a huge amount of resources once the files exceed 750kB. I thought Marpa would perform better, but for the 2MB file it requires 33GB of memory to perform the parse. Here's the two grammars (and test scripts) in case you'd like to have a look:

Regexp::Grammars

use strict;
use warnings;

my $file = $ARGV[0];
my $s;

if (defined $file) {
    # parse data from file argument
    open my $fh, '<', $file
        or die "Failed to open file ($file):$!\n";

    # for processing speed, tossing out full line comments and empty l
+ines now
    while (<$fh>) {
        next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/;
        next if /^\s*$/;
        $s .= $_
    }
}
else {
    # parse data from DATA
    while (<DATA>) {
        next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/;
        next if /^\s*$/;
        $s .= $_
    }
}

close $fh or die "Failed to close file ($file): $!\n";

#print STDERR "DATA :\n$s\n";


use Regexp::Grammars;

my $parser = qr/
#    <logfile: ->
#    <debug: on>
    <nocontext:>
    <plist_file>
    <MATCH=(?{ $MATCH{plist_file} })>

    <token: float>           \d+(\.\d+)?
    <token: global_plist_declare>    GlobalPList|LocalPList
    <token: reference_plist_declare> PList
    <token: plist_name>      \w+
    <token: plist_option>    \[[\w \.,]*\]
    <token: base_number>     \d+
    <token: comment>         \#[^\n]*
    <token: pat_declare>     Pat|Pattern
    <token: pat_name>        \w+
    <token: tag>             \w+
    <token: plist_open>      \{
    <token: plist_close>     \}
    <token: file_name>       [\w\.]+

    <rule: plist_file>
        <version>
        <[global_plist]>+

    <rule: version>
        Version <float> \;
        <MATCH=(?{ $MATCH{float} })>

    <token: embedded_base>
        \#\s*base\s*=\s*<[base_number]>+ % ,\s*\n
        <MATCH=(?{ $MATCH{base_number} })>

    <token: tags>
        \#<[tag]>+ % ,\#
        <MATCH=(?{ $MATCH{tag} })>

    <rule: global_plist>
        <.global_plist_declare>
        <plist_name>
        <[plist_option]>* % \s*
        <.plist_open>
        <embedded_base>?
        (<[data_node]> | <.comment>)+
        <.plist_close>

    <rule: data_node>
        <pattern> | <global_plist> | <reference_plist>

    <rule: reference_plist>
        <.reference_plist_declare>
        (  <file_name> : <plist_name>
           | <plist_name>
        )
        ;

    <rule: pattern>
        <.pat_declare> <pat_name> ; <tags>?

/xms;

$s =~ $parser;

use Data::Dumper;
open $fh, '>', 'C:\tmp\tmp' or die "FAILED!\n";
print $fh "RESULTS:".Dumper(\%/);
print STDERR "RESULTS:".Dumper(\%/);

__DATA__
Version 1.0;

# Plist SVN Url: $HeadURL: https://XXXXXX...
# Plist SVN Revision: $Id: gt.plist 2555 2014-02-06 16:16:26Z vsgatcha
+ $

# RunDir: /XXXXXX...

GlobalPList plist1 [option1] [option2] {
    #base=111
    # this is a full line comment
    Pat n1000000g0000001; # this is a partial line comment
    Pat n2000000g0000002; #HOT#

    LocalPList plist2 {
        Pat n5000000g0000005;
        GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } }

    Pat n3000000g0000003; #HOT,COLD#
    Pat n4000000g0000004;

    PList plist2;

    PList file1:plist3;
}

GlobalPList plist2 [option3] {
    PList plist1;
    Pattern n6000000g0000006;
    Pattern n7000000g0000007;
}
[download]

Marpa::R2

use strict;
use warnings;

use Data::Dumper;

my $file = $ARGV[0];
my $s;

if (defined $file) {
    # parse data from file if given
    open my $fh, '<', $file
        or die "Failed to open file ($file):$!\n";

    # for processing speed, tossing out full line comments and empty l
+ines now
    while (<$fh>) {
        next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/;
        next if /^\s*$/;
        $s .= $_
    }

    close $fh or die "Failed to close file ($file): $!\n";
}
else {
    # parse data from DATA
    while (<DATA>) {
        next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/;
        next if /^\s*$/;
        $s .= $_
    }
}

#print STDERR "DATA :\n$s\n";

use Marpa::R2;

my $grammar_str = <<'END_GRAMMAR';
inaccessible is fatal by default
lexeme default = latm => 1

:start            ::= plist_file
plist_file        ::= version_data ows global_plists

version_data      ::= 'Version' mws float ows ';'

global_plists     ::= global_plist+
global_plist      ::= global_pl_declare mws pl_name ows opt_pl_options
+ ows '{' ows opt_embedded_base ows nodes '}'

pl_name           ::= [\w]+
opt_pl_options    ::= option*
option            ::= '[' ows option_data ows ']' ows

nodes             ::= node+
node              ::= pattern | comment | global_plist || reference_pl
+ist

pattern           ::= pat_declare mws pat_name ows opt_pat_option ';' 
+ows opt_tag_str ows
opt_pat_option    ::= option*
comment           ::= '#' comment_chars newline ows
reference_plist   ::= 'PList' mws ref_pl_name ows ';' ows

ref_pl_name       ::= opt_ref_file pl_name
opt_ref_file      ::= ref_file*
ref_file          ::=  file_name ':'
file_name         ::= [\w\.]+

opt_tag_str       ::= tag_str*
tag_str           ::= '#' ows tag_list ows '#'
tag_list          ::= tag*                   separator => comma
opt_embedded_base ::= embedded_base*
embedded_base     ::= '#' ows 'base' ows '=' ows base_numbers
base_numbers      ::= base_number+           separator => comma

comma             ::= ','
pat_name          ::= [\w]+
pat_declare       ::= 'Pat'|'Pattern'
comment_chars     ::= [^\n]*
newline           ::= [\n]
int               ::= [\d]+
tag               ::= [\w]+
global_pl_declare ::= 'GlobalPList'|'LocalPList'|'PatternList'
option_data       ::= [\w \.,]*
base_number       ::= [\d]+
float             ::= int opt_fractional
opt_fractional    ::= fractional*
fractional        ::= '.' int

# optional whitespace
ows               ::= [\s]*
# mandatory whitespace
mws               ::= [\s]+
END_GRAMMAR

my $grammar = Marpa::R2::Scanless::G->new({
    source => \$grammar_str,
});

my $parser = Marpa::R2::Scanless::R->new({
    grammar           => $grammar,
#    trace_values      => 2,
#    trace_terminals   => 1,
});

eval {
    $parser->read( \$s );
};
if ($@) {
    print "PARSE ERROR:$@\n";
    die "EXITING\n";
#    die $parser->show_progress(0, -1);
}
else {
    print "SUCCESSFUL PARSE!\n"; <STDIN>;
}

print STDERR Dumper( $parser->value );


__DATA__
Version 1.0;

# Plist SVN Url: $HeadURL: https://XXXXXX...
# Plist SVN Revision: $Id: gt.plist 2555 2014-02-06 16:16:26Z vsgatcha
+ $

# RunDir: /XXXXXX...

GlobalPList plist1 [option1] [option2] {
    #base=111
    # this is a full line comment
    Pat n1000000g0000001; # this is a partial line comment
    Pat n2000000g0000002; #HOT#

    LocalPList plist2 {
        Pat n5000000g0000005;
        GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } }

    Pat n3000000g0000003; #HOT,COLD#
    Pat n4000000g0000004;

    PList plist2;

    PList file1.plist:plist3;
}

GlobalPList plist2 [option3] {
    PList plist1;
    Pattern n6000000g0000006;
    Pattern n7000000g0000007;
}
[download]

So on to my actual question. Is this just too large a piece of data to parse with a grammar based parser? Are my grammars just horribly inefficient? Is there a better way to handle parsing this data? Note the Marpa grammar actually doesn't even capture any results yet.

Thanks for any insight ahead of time :) I sincerely appreciate the time spent looking over the problem.

Thomas

Comment on Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) Select or Download Code

Replies are listed 'Best First'.
Re: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by amon (Scribe) on May 06, 2014 at 21:23 UTC
Your Marpa grammar has a problem. Marpa's Scanless Interface uses a two-level grammar: a simple and efficient grammar for the lexer which breaks up the source into tokens for the more versatile high-level grammar. Your grammar currently only uses the high-level interface, which leads to an incredible amount of ambiguity. Rules in the lexer grammar are not declared with “`::=`” but with “`~`”. A rule declared in this way can either be used as a “terminal symbol” in the high-level grammar, or in another low-level rule. Let's rewrite your grammar accordingly. As a naming convention, I used CamelCase for rules in the high-level grammar, ALL_UPPERCASE for terminal symbols in the high-level grammar, and snake_case for other rules in the low-level grammar. inaccessible is fatal by default lexeme default = latm => 1 :start ::= PlistFile PlistFile ::= VersionData Ows GlobalPlists VersionData ::= 'Version' WS FLOAT Ows ';' GlobalPlists ::= GlobalPlist+ GlobalPlist ::= GLOBAL_PL_DECLARE WS PL_NAME Ows OptPlOptions Ows '{' + Ows OptEmbeddedBase Ows Nodes '}' Ows OptPlOptions ::= Option* Option ::= '[' Ows OPTION_DATA Ows ']' Ows OptEmbeddedBase ::= EmbeddedBase* EmbeddedBase ::= '#' Ows 'base' Ows '=' Ows BaseNumbers BaseNumbers ::= BASE_NUMBER+ separator => COMMA Nodes ::= Node+ Node ::= Pattern \| COMMENT \| GlobalPlist \|\| ReferencePlist Pattern ::= PAT_DECLARE WS PAT_NAME Ows OptPatOption ';' Ows OptT +agStr Ows OptPatOption ::= Option OptPatOption ::= OptTagStr ::= TagStr OptTagStr ::= TagStr ::= '#' Ows TagList Ows '#' TagList ::= TAG* separator => COMMA ReferencePlist ::= 'PList' WS RefPlName Ows ';' Ows RefPlName ::= OptRefFile PL_NAME OptRefFile ::= RefFile* RefFile ::= FILE_NAME ':' Ows ::= WS # a lexeme cannot have zero length, Ows ::= # so optional whitespace must be a high-level grammar feat +ure WS ~ ws COMMA ~ ',' COMMENT ~ '#' comment_chars [\n] \| '#' comment_chars [\n] ws FLOAT ~ int \| int '.' int BASE_NUMBER ~ int PL_NAME ~ identifier TAG ~ identifier PAT_NAME ~ identifier PAT_DECLARE ~ 'Pat' \| 'Pattern' GLOBAL_PL_DECLARE ~ 'GlobalPList' \| 'LocalPList' \| 'PatternList' FILE_NAME ~ [\w.]+ OPTION_DATA ~ [\w \.,]* ws ~ [\s]+ identifier ~ [\w]+ comment_chars ~ [^\n]+ int ~ [\d]+ [download] Notice also that a “*” quantifier repeats a rule, instead of making it optional. If you want to signal that a rule is optional, then add an empty production `Rule ::=`. Lexemes cannot have zero length, and must always consume characters. As this grammar is tidied up and shoves as much as possible into the more efficient low-level grammar, it should also use less memory. However, two problems remain: There is some ambiguity between comments and base specifications or tag strings. This might lead to false parses, and is a source of inefficiency. You can reduce this by making “`#`” an illegal character inside comments. You specify all whitespace manually. You could also make whitespace in the high-level rules implicit, via a `:discard` lexeme. This will increase efficiency as the high-level grammar has to deal with fewer symbols. While you can still require whitespace explicitly, whitespace would then be allowed between any rules.	[reply] [d/l] [select]
Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by tj_thompson (Monk) on May 06, 2014 at 23:21 UTC
Your changes were certainly effective. The 2MB file I was parsing previously took 33+GB of memory to parse. Your updated changes require ~1GB. I also tried the :discard lexeme to remove the Ows references. This worked on the first attempt, so I was clearly doing something wrong the first time around. This reduced the memory usage further to ~500MB and sped up the parse time. It now sits at about 15s vs. the old parser's 9s. Many thanks for the insight. Now I get to figure out how to get the data back from the parse :)	[reply]
Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by tj_thompson (Monk) on May 06, 2014 at 22:39 UTC
I had little doubt my Marpa grammar had issues. I'm still trying to wrap my head around lexer vs. parser and how Marpa's two levels interact. I figured the main difference was captures vs. non capture in my reading. I didn't realize there was a performance impact. I'm still having a very difficult time understanding the differences between the G0/L0 rules. I actually tried to use the :discard lexeme, but I found that I couldn't get the required newline after a partial line comment to match properly after this. They seemed to be discarded before they could match. Maybe I'll have another look. Thanks for the input, it's greatly appreciated :) I'll look over your changes and play with the code to see what I can learn.	[reply]
Re: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by Laurent_R (Canon) on May 06, 2014 at 21:26 UTC
Hmm, a grammar will usually build a data structure that takes much more memory that the original text file, but it sounds a bit excessive to me that a 2-MB file would imply a 33 GB memory foot print, that's really a huge difference. My second point is that, from a quick look, your file looks pretty regular, I am not sure that a full-fledged grammar is really required (understand me, I am really in favor of using real parsers as soon as the input gets a bit complicated in terms of positions of tokens and the like, but here, I am not totally convinced that it is not overkill or over-engineering). From a quick glance at your data, I would probably have a couple of regexes to get rid of comments and then parse it manually using the semi-colon as a separator. But that's only a personal opinion, I did not take the time to analyze your grammars in any detail and I have no idea of what you are really trying to do with this data, so I may be completely wrong. This was just premature guts feeling after a very brief look at your problem, it could be that just working an hour or two on it would lead me to a very different conclusion. Please also note that I know a very little bit about Regexp::Grammars, but next to nothing about Marpa.	[reply]
Re^2: Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by tj_thompson (Monk) on May 06, 2014 at 23:06 UTC
I understand your point on questioning whether a grammar is necessary. I have up till now used a line by line method. However, I've dealt with enough bugs now that I'd rather just invest the time in a grammar that will do the job and I believe will be easier to maintain in the future. I also foresee this syntax changing due to production changes in the not too distant future. I think a grammar will be easier to update than my current parser would have been. Thanks for looking it over for me in any case :)	[reply]