in reply to advice with Parse::RecDescent
RecDescent is overkill for this project, unless you expect it to grow in complexity (i.e. not just in the number of tags you're handling, but greater structural complexity of the data).
A good indicator that a grammar is overkill is when it:
Moreover, when the data is line-based (i.e. each low-level rule in the grammar parses exactly one line), RecDescent is probably not needed.
Your grammar seems to meet most of those criteria.
On the other hand, the parsing task you have is very well suited for learning RecDescent.
If I were implementing a parser for this in real life, rather than as a teaching exercise, I would probably bundle the regexes for each line type into a hash, and then iterate lines, testing against the various alternatives. Like so:
my $name = qr/(?:\w+)/; my $data = qr/(?:\w+)/; my $num = qr/(?:\d+)/; my %line_is = ( header => qr/HDR($name) ($data)/, trailer => qr/TLR($num)/, additive => qr/(ADDRANGE|ADD|DELETERANGE|DELETE),/, additive_data => qr/($num),($num?),($name)/, ); $_ = qr/\G(?:$_)/ foreach values %line_is; my %data; while (<DATA>) { if (/$line_is{header}/gcx) { $data{header} = { company => $1, code => $2 } } elsif (/$line_is{trailer}/gcx) { $data{trailer} = { count => $1 } } elsif (/$line_is{additive}/gcx) { my $cmd = $1; warn "Bad $cmd: ", substr($_,pos) unless /$line_is{additive_data}/; push @{$data{record}}, [ $cmd, $1, $2||undef, $3 ] } else { warn "Unparsable data: ", substr($_,pos); } } use Data::Dumper 'Dumper'; print Dumper [ \%data ]; __DATA__ HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADDRANGE,2468,4680,COMPNAME DELETE,987654321,,COMPNAME DELETERANGE,13579,13599,COMPNAME TLR000004
The result is quite readable and maintainable. And fast. Provided, of course, the data remains line-oriented.
Finally, I do have big plans to rewrite RecDescent to make it much faster (though probably still Pure Perl). The original module was only supposed to be a quick-hack proof-of-concept for self-modifying parsers. It predates the /gc flag; hence the clunky (and slow!) parsing-by-substitution-of-copies idiom.
But somehow escaped the lab and has subsequently infested a huge number of organizations, which now rely on it.
There's probably a lesson in that. ;-)
|
|---|