Hello monks,
I'm attempting to parse a set of files. File size ranges from a few kB to multiple MB (< 10MB). My previous solution was line by line parsing but I've bumped into a number of cases where newlines in unexpected places (or not having them in expected places) breaks my parser but is legal syntax for the file being parsed. Here's a short example of what a typical file looks like. Note that some of the comments contain necessary data also, such as the 'base=111' and '#KEEP#' or '#HOT,COLD#' comments. Also, please excuse the difficult to read grammar. This was just a trial and error till it worked chunk of test code.
Version 1.0; GlobalPList plist1 [option1] [option2] { #base=111 # this is a full line comment Pat n1000000g0000001; # this is a partial line comment Pat n2000000g0000002; #HOT# LocalPList plist2 { Pat n5000000g0000005; GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } } Pat n3000000g0000003; #HOT,COLD# Pat n4000000g0000004; PList plist2; PList file1.plist:plist3; } GlobalPList plist2 [option3] { PList plist1; Pattern n6000000g0000006; Pattern n7000000g0000007; }
I have written a grammar based parser in both Regexp::Grammars as well as Marpa::R2, but both seem to require a huge amount of resources once the files exceed 750kB. I thought Marpa would perform better, but for the 2MB file it requires 33GB of memory to perform the parse. Here's the two grammars (and test scripts) in case you'd like to have a look:
Regexp::Grammars
use strict; use warnings; my $file = $ARGV[0]; my $s; if (defined $file) { # parse data from file argument open my $fh, '<', $file or die "Failed to open file ($file):$!\n"; # for processing speed, tossing out full line comments and empty l +ines now while (<$fh>) { next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/; next if /^\s*$/; $s .= $_ } } else { # parse data from DATA while (<DATA>) { next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/; next if /^\s*$/; $s .= $_ } } close $fh or die "Failed to close file ($file): $!\n"; #print STDERR "DATA :\n$s\n"; use Regexp::Grammars; my $parser = qr/ # <logfile: -> # <debug: on> <nocontext:> <plist_file> <MATCH=(?{ $MATCH{plist_file} })> <token: float> \d+(\.\d+)? <token: global_plist_declare> GlobalPList|LocalPList <token: reference_plist_declare> PList <token: plist_name> \w+ <token: plist_option> \[[\w \.,]*\] <token: base_number> \d+ <token: comment> \#[^\n]* <token: pat_declare> Pat|Pattern <token: pat_name> \w+ <token: tag> \w+ <token: plist_open> \{ <token: plist_close> \} <token: file_name> [\w\.]+ <rule: plist_file> <version> <[global_plist]>+ <rule: version> Version <float> \; <MATCH=(?{ $MATCH{float} })> <token: embedded_base> \#\s*base\s*=\s*<[base_number]>+ % ,\s*\n <MATCH=(?{ $MATCH{base_number} })> <token: tags> \#<[tag]>+ % ,\# <MATCH=(?{ $MATCH{tag} })> <rule: global_plist> <.global_plist_declare> <plist_name> <[plist_option]>* % \s* <.plist_open> <embedded_base>? (<[data_node]> | <.comment>)+ <.plist_close> <rule: data_node> <pattern> | <global_plist> | <reference_plist> <rule: reference_plist> <.reference_plist_declare> ( <file_name> : <plist_name> | <plist_name> ) ; <rule: pattern> <.pat_declare> <pat_name> ; <tags>? /xms; $s =~ $parser; use Data::Dumper; open $fh, '>', 'C:\tmp\tmp' or die "FAILED!\n"; print $fh "RESULTS:".Dumper(\%/); print STDERR "RESULTS:".Dumper(\%/); __DATA__ Version 1.0; # Plist SVN Url: $HeadURL: https://XXXXXX... # Plist SVN Revision: $Id: gt.plist 2555 2014-02-06 16:16:26Z vsgatcha + $ # RunDir: /XXXXXX... GlobalPList plist1 [option1] [option2] { #base=111 # this is a full line comment Pat n1000000g0000001; # this is a partial line comment Pat n2000000g0000002; #HOT# LocalPList plist2 { Pat n5000000g0000005; GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } } Pat n3000000g0000003; #HOT,COLD# Pat n4000000g0000004; PList plist2; PList file1:plist3; } GlobalPList plist2 [option3] { PList plist1; Pattern n6000000g0000006; Pattern n7000000g0000007; }
Marpa::R2
use strict; use warnings; use Data::Dumper; my $file = $ARGV[0]; my $s; if (defined $file) { # parse data from file if given open my $fh, '<', $file or die "Failed to open file ($file):$!\n"; # for processing speed, tossing out full line comments and empty l +ines now while (<$fh>) { next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/; next if /^\s*$/; $s .= $_ } close $fh or die "Failed to close file ($file): $!\n"; } else { # parse data from DATA while (<DATA>) { next if /^\s*#.*$/ and !/^\s*#\s*base\s*=/; next if /^\s*$/; $s .= $_ } } #print STDERR "DATA :\n$s\n"; use Marpa::R2; my $grammar_str = <<'END_GRAMMAR'; inaccessible is fatal by default lexeme default = latm => 1 :start ::= plist_file plist_file ::= version_data ows global_plists version_data ::= 'Version' mws float ows ';' global_plists ::= global_plist+ global_plist ::= global_pl_declare mws pl_name ows opt_pl_options + ows '{' ows opt_embedded_base ows nodes '}' pl_name ::= [\w]+ opt_pl_options ::= option* option ::= '[' ows option_data ows ']' ows nodes ::= node+ node ::= pattern | comment | global_plist || reference_pl +ist pattern ::= pat_declare mws pat_name ows opt_pat_option ';' +ows opt_tag_str ows opt_pat_option ::= option* comment ::= '#' comment_chars newline ows reference_plist ::= 'PList' mws ref_pl_name ows ';' ows ref_pl_name ::= opt_ref_file pl_name opt_ref_file ::= ref_file* ref_file ::= file_name ':' file_name ::= [\w\.]+ opt_tag_str ::= tag_str* tag_str ::= '#' ows tag_list ows '#' tag_list ::= tag* separator => comma opt_embedded_base ::= embedded_base* embedded_base ::= '#' ows 'base' ows '=' ows base_numbers base_numbers ::= base_number+ separator => comma comma ::= ',' pat_name ::= [\w]+ pat_declare ::= 'Pat'|'Pattern' comment_chars ::= [^\n]* newline ::= [\n] int ::= [\d]+ tag ::= [\w]+ global_pl_declare ::= 'GlobalPList'|'LocalPList'|'PatternList' option_data ::= [\w \.,]* base_number ::= [\d]+ float ::= int opt_fractional opt_fractional ::= fractional* fractional ::= '.' int # optional whitespace ows ::= [\s]* # mandatory whitespace mws ::= [\s]+ END_GRAMMAR my $grammar = Marpa::R2::Scanless::G->new({ source => \$grammar_str, }); my $parser = Marpa::R2::Scanless::R->new({ grammar => $grammar, # trace_values => 2, # trace_terminals => 1, }); eval { $parser->read( \$s ); }; if ($@) { print "PARSE ERROR:$@\n"; die "EXITING\n"; # die $parser->show_progress(0, -1); } else { print "SUCCESSFUL PARSE!\n"; <STDIN>; } print STDERR Dumper( $parser->value ); __DATA__ Version 1.0; # Plist SVN Url: $HeadURL: https://XXXXXX... # Plist SVN Revision: $Id: gt.plist 2555 2014-02-06 16:16:26Z vsgatcha + $ # RunDir: /XXXXXX... GlobalPList plist1 [option1] [option2] { #base=111 # this is a full line comment Pat n1000000g0000001; # this is a partial line comment Pat n2000000g0000002; #HOT# LocalPList plist2 { Pat n5000000g0000005; GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } } Pat n3000000g0000003; #HOT,COLD# Pat n4000000g0000004; PList plist2; PList file1.plist:plist3; } GlobalPList plist2 [option3] { PList plist1; Pattern n6000000g0000006; Pattern n7000000g0000007; }
So on to my actual question. Is this just too large a piece of data to parse with a grammar based parser? Are my grammars just horribly inefficient? Is there a better way to handle parsing this data? Note the Marpa grammar actually doesn't even capture any results yet.
Thanks for any insight ahead of time :) I sincerely appreciate the time spent looking over the problem.
Thomas
In reply to Grammar based parsing methodology for multi MB strings/files (Marpa::R2/Regexp::Grammars) by tj_thompson
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |