comment on

Hello monks,

I'm attempting to parse a set of files. File size ranges from a few kB to multiple MB (< 10MB). My previous solution was line by line parsing but I've bumped into a number of cases where newlines in unexpected places (or not having them in expected places) breaks my parser but is legal syntax for the file being parsed. Here's a short example of what a typical file looks like. Note that some of the comments contain necessary data also, such as the 'base=111' and '#KEEP#' or '#HOT,COLD#' comments. Also, please excuse the difficult to read grammar. This was just a trial and error till it worked chunk of test code.

Version 1.0;

GlobalPList plist1 [option1] [option2] {
    #base=111
    # this is a full line comment
    Pat n1000000g0000001; # this is a partial line comment
    Pat n2000000g0000002; #HOT#

    LocalPList plist2 {
        Pat n5000000g0000005;
        GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } }

    Pat n3000000g0000003; #HOT,COLD#
    Pat n4000000g0000004;

    PList plist2;

    PList file1.plist:plist3;
}

GlobalPList plist2 [option3] {
    PList plist1;
    Pattern n6000000g0000006;
    Pattern n7000000g0000007;
}
[download]

I have written a grammar based parser in both Regexp::Grammars as well as Marpa::R2, but both seem to require a huge amount of resources once the files exceed 750kB. I thought Marpa would perform better, but for the 2MB file it requires 33GB of memory to perform the parse. Here's the two grammars (and test scripts) in case you'd like to have a look:

Regexp::Grammars

use strict;
use warnings;

my $file = $ARGV[0];
my $s;

if (defined $file) {
    # parse data from file argument
    open my $fh, '<', $file
        or die "Failed to open file ($file):$!\n";

    # for processing speed, tossing out full line comments and empty l
+ines now
    while (<$fh>) {
        next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/;
        next if /^\s*$/;
        $s .= $_
    }
}
else {
    # parse data from DATA
    while (<DATA>) {
        next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/;
        next if /^\s*$/;
        $s .= $_
    }
}

close $fh or die "Failed to close file ($file): $!\n";

#print STDERR "DATA :\n$s\n";


use Regexp::Grammars;

my $parser = qr/
#    <logfile: ->
#    <debug: on>
    <nocontext:>
    <plist_file>
    <MATCH=(?{ $MATCH{plist_file} })>

    <token: float>           \d+(\.\d+)?
    <token: global_plist_declare>    GlobalPList|LocalPList
    <token: reference_plist_declare> PList
    <token: plist_name>      \w+
    <token: plist_option>    \[[\w \.,]*\]
    <token: base_number>     \d+
    <token: comment>         \#[^\n]*
    <token: pat_declare>     Pat|Pattern
    <token: pat_name>        \w+
    <token: tag>             \w+
    <token: plist_open>      \{
    <token: plist_close>     \}
    <token: file_name>       [\w\.]+

    <rule: plist_file>
        <version>
        <[global_plist]>+

    <rule: version>
        Version <float> \;
        <MATCH=(?{ $MATCH{float} })>

    <token: embedded_base>
        \#\s*base\s*=\s*<[base_number]>+ % ,\s*\n
        <MATCH=(?{ $MATCH{base_number} })>

    <token: tags>
        \#<[tag]>+ % ,\#
        <MATCH=(?{ $MATCH{tag} })>

    <rule: global_plist>
        <.global_plist_declare>
        <plist_name>
        <[plist_option]>* % \s*
        <.plist_open>
        <embedded_base>?
        (<[data_node]> | <.comment>)+
        <.plist_close>

    <rule: data_node>
        <pattern> | <global_plist> | <reference_plist>

    <rule: reference_plist>
        <.reference_plist_declare>
        (  <file_name> : <plist_name>
           | <plist_name>
        )
        ;

    <rule: pattern>
        <.pat_declare> <pat_name> ; <tags>?

/xms;

$s =~ $parser;

use Data::Dumper;
open $fh, '>', 'C:\tmp\tmp' or die "FAILED!\n";
print $fh "RESULTS:".Dumper(\%/);
print STDERR "RESULTS:".Dumper(\%/);

__DATA__
Version 1.0;

# Plist SVN Url: $HeadURL: https://XXXXXX...
# Plist SVN Revision: $Id: gt.plist 2555 2014-02-06 16:16:26Z vsgatcha
+ $

# RunDir: /XXXXXX...

GlobalPList plist1 [option1] [option2] {
    #base=111
    # this is a full line comment
    Pat n1000000g0000001; # this is a partial line comment
    Pat n2000000g0000002; #HOT#

    LocalPList plist2 {
        Pat n5000000g0000005;
        GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } }

    Pat n3000000g0000003; #HOT,COLD#
    Pat n4000000g0000004;

    PList plist2;

    PList file1:plist3;
}

GlobalPList plist2 [option3] {
    PList plist1;
    Pattern n6000000g0000006;
    Pattern n7000000g0000007;
}
[download]

Marpa::R2

use strict;
use warnings;

use Data::Dumper;

my $file = $ARGV[0];
my $s;

if (defined $file) {
    # parse data from file if given
    open my $fh, '<', $file
        or die "Failed to open file ($file):$!\n";

    # for processing speed, tossing out full line comments and empty l
+ines now
    while (<$fh>) {
        next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/;
        next if /^\s*$/;
        $s .= $_
    }

    close $fh or die "Failed to close file ($file): $!\n";
}
else {
    # parse data from DATA
    while (<DATA>) {
        next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/;
        next if /^\s*$/;
        $s .= $_
    }
}

#print STDERR "DATA :\n$s\n";

use Marpa::R2;

my $grammar_str = <<'END_GRAMMAR';
inaccessible is fatal by default
lexeme default = latm => 1

:start            ::= plist_file
plist_file        ::= version_data ows global_plists

version_data      ::= 'Version' mws float ows ';'

global_plists     ::= global_plist+
global_plist      ::= global_pl_declare mws pl_name ows opt_pl_options
+ ows '{' ows opt_embedded_base ows nodes '}'

pl_name           ::= [\w]+
opt_pl_options    ::= option*
option            ::= '[' ows option_data ows ']' ows

nodes             ::= node+
node              ::= pattern | comment | global_plist || reference_pl
+ist

pattern           ::= pat_declare mws pat_name ows opt_pat_option ';' 
+ows opt_tag_str ows
opt_pat_option    ::= option*
comment           ::= '#' comment_chars newline ows
reference_plist   ::= 'PList' mws ref_pl_name ows ';' ows

ref_pl_name       ::= opt_ref_file pl_name
opt_ref_file      ::= ref_file*
ref_file          ::=  file_name ':'
file_name         ::= [\w\.]+

opt_tag_str       ::= tag_str*
tag_str           ::= '#' ows tag_list ows '#'
tag_list          ::= tag*                   separator => comma
opt_embedded_base ::= embedded_base*
embedded_base     ::= '#' ows 'base' ows '=' ows base_numbers
base_numbers      ::= base_number+           separator => comma

comma             ::= ','
pat_name          ::= [\w]+
pat_declare       ::= 'Pat'|'Pattern'
comment_chars     ::= [^\n]*
newline           ::= [\n]
int               ::= [\d]+
tag               ::= [\w]+
global_pl_declare ::= 'GlobalPList'|'LocalPList'|'PatternList'
option_data       ::= [\w \.,]*
base_number       ::= [\d]+
float             ::= int opt_fractional
opt_fractional    ::= fractional*
fractional        ::= '.' int

# optional whitespace
ows               ::= [\s]*
# mandatory whitespace
mws               ::= [\s]+
END_GRAMMAR

my $grammar = Marpa::R2::Scanless::G->new({
    source => \$grammar_str,
});

my $parser = Marpa::R2::Scanless::R->new({
    grammar           => $grammar,
#    trace_values      => 2,
#    trace_terminals   => 1,
});

eval {
    $parser->read( \$s );
};
if ($@) {
    print "PARSE ERROR:$@\n";
    die "EXITING\n";
#    die $parser->show_progress(0, -1);
}
else {
    print "SUCCESSFUL PARSE!\n"; <STDIN>;
}

print STDERR Dumper( $parser->value );


__DATA__
Version 1.0;

# Plist SVN Url: $HeadURL: https://XXXXXX...
# Plist SVN Revision: $Id: gt.plist 2555 2014-02-06 16:16:26Z vsgatcha
+ $

# RunDir: /XXXXXX...

GlobalPList plist1 [option1] [option2] {
    #base=111
    # this is a full line comment
    Pat n1000000g0000001; # this is a partial line comment
    Pat n2000000g0000002; #HOT#

    LocalPList plist2 {
        Pat n5000000g0000005;
        GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } }

    Pat n3000000g0000003; #HOT,COLD#
    Pat n4000000g0000004;

    PList plist2;

    PList file1.plist:plist3;
}

GlobalPList plist2 [option3] {
    PList plist1;
    Pattern n6000000g0000006;
    Pattern n7000000g0000007;
}
[download]

So on to my actual question. Is this just too large a piece of data to parse with a grammar based parser? Are my grammars just horribly inefficient? Is there a better way to handle parsing this data? Note the Marpa grammar actually doesn't even capture any results yet.

Thanks for any insight ahead of time :) I sincerely appreciate the time spent looking over the problem.

Thomas

In reply to Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by tj_thompson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.