comment on

There has been plenty of good advice already, but I suppose I should offer mine anyway. ;-)

RecDescent is overkill for this project, unless you expect it to grow in complexity (i.e. not just in the number of tags you're handling, but greater structural complexity of the data).

A good indicator that a grammar is overkill is when it:

doesn't have many levels of rules
doesn't have many rules with two or more productions
doesn't construct a complex, multi-level data structure as it parses
does the vast majority of its work with rules that consist of a single regex

Moreover, when the data is line-based (i.e. each low-level rule in the grammar parses exactly one line), RecDescent is probably not needed.

Your grammar seems to meet most of those criteria.

On the other hand, the parsing task you have is very well suited for learning RecDescent.

If I were implementing a parser for this in real life, rather than as a teaching exercise, I would probably bundle the regexes for each line type into a hash, and then iterate lines, testing against the various alternatives. Like so:

my $name   = qr/(?:\w+)/;
my $data   = qr/(?:\w+)/;
my $num    = qr/(?:\d+)/;

my %line_is = (
        header          => qr/HDR($name) ($data)/,
        trailer         => qr/TLR($num)/,
        additive        => qr/(ADDRANGE|ADD|DELETERANGE|DELETE),/,
        additive_data   => qr/($num),($num?),($name)/,
);

$_ = qr/\G(?:$_)/ foreach values %line_is;

my %data;

while (<DATA>) {
        if (/$line_is{header}/gcx) {
                $data{header} = { company => $1, code => $2 }
        }
        elsif (/$line_is{trailer}/gcx) {
                $data{trailer} = { count => $1 }
        }
        elsif (/$line_is{additive}/gcx) {
                my $cmd = $1;
                warn "Bad $cmd: ", substr($_,pos)
                        unless /$line_is{additive_data}/;
                push @{$data{record}}, [ $cmd, $1, $2||undef, $3 ]
        }
        else {
                warn "Unparsable data: ", substr($_,pos);
        }
}

use Data::Dumper 'Dumper';
print Dumper [ \%data ];

__DATA__
HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP
ADD,1234567890,,COMPNAME
ADDRANGE,2468,4680,COMPNAME
DELETE,987654321,,COMPNAME
DELETERANGE,13579,13599,COMPNAME
TLR000004
[download]

The result is quite readable and maintainable. And fast. Provided, of course, the data remains line-oriented.

Finally, I do have big plans to rewrite RecDescent to make it much faster (though probably still Pure Perl). The original module was only supposed to be a quick-hack proof-of-concept for self-modifying parsers. It predates the /gc flag; hence the clunky (and slow!) parsing-by-substitution-of-copies idiom.

But somehow escaped the lab and has subsequently infested a huge number of organizations, which now rely on it.

There's probably a lesson in that. ;-)

In reply to Re: advice with Parse::RecDescent by TheDamian
in thread advice with Parse::RecDescent by demerphq

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.