Re: advice with Parse::RecDescent

There has been plenty of good advice already, but I suppose I should offer mine anyway. ;-)

RecDescent is overkill for this project, unless you expect it to grow in complexity (i.e. not just in the number of tags you're handling, but greater structural complexity of the data).

A good indicator that a grammar is overkill is when it:

doesn't have many levels of rules
doesn't have many rules with two or more productions
doesn't construct a complex, multi-level data structure as it parses
does the vast majority of its work with rules that consist of a single regex

Moreover, when the data is line-based (i.e. each low-level rule in the grammar parses exactly one line), RecDescent is probably not needed.

Your grammar seems to meet most of those criteria.

On the other hand, the parsing task you have is very well suited for learning RecDescent.

If I were implementing a parser for this in real life, rather than as a teaching exercise, I would probably bundle the regexes for each line type into a hash, and then iterate lines, testing against the various alternatives. Like so:

my $name   = qr/(?:\w+)/;
my $data   = qr/(?:\w+)/;
my $num    = qr/(?:\d+)/;

my %line_is = (
        header          => qr/HDR($name) ($data)/,
        trailer         => qr/TLR($num)/,
        additive        => qr/(ADDRANGE|ADD|DELETERANGE|DELETE),/,
        additive_data   => qr/($num),($num?),($name)/,
);

$_ = qr/\G(?:$_)/ foreach values %line_is;

my %data;

while (<DATA>) {
        if (/$line_is{header}/gcx) {
                $data{header} = { company => $1, code => $2 }
        }
        elsif (/$line_is{trailer}/gcx) {
                $data{trailer} = { count => $1 }
        }
        elsif (/$line_is{additive}/gcx) {
                my $cmd = $1;
                warn "Bad $cmd: ", substr($_,pos)
                        unless /$line_is{additive_data}/;
                push @{$data{record}}, [ $cmd, $1, $2||undef, $3 ]
        }
        else {
                warn "Unparsable data: ", substr($_,pos);
        }
}

use Data::Dumper 'Dumper';
print Dumper [ \%data ];

__DATA__
HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP
ADD,1234567890,,COMPNAME
ADDRANGE,2468,4680,COMPNAME
DELETE,987654321,,COMPNAME
DELETERANGE,13579,13599,COMPNAME
TLR000004
[download]

The result is quite readable and maintainable. And fast. Provided, of course, the data remains line-oriented.

Finally, I do have big plans to rewrite RecDescent to make it much faster (though probably still Pure Perl). The original module was only supposed to be a quick-hack proof-of-concept for self-modifying parsers. It predates the /gc flag; hence the clunky (and slow!) parsing-by-substitution-of-copies idiom.

But somehow escaped the lab and has subsequently infested a huge number of organizations, which now rely on it.

There's probably a lesson in that. ;-)

Comment on Re: advice with Parse::RecDescent Download Code