demerphq has asked for the wisdom of the Perl Monks concerning the following question:

Over the weekend I started playing with Parse::RecDescent and decided to apply it to a real world problem. Now perhaps using a tool this powerful for what I want to do is gross overkill, but it seemed that using it resulted in less code that is more maintainable than hand rolling a parser. Seeing how much advice in the monastery tends to argue that maintainable code is superior to faster code this route seemed ideal. Unfortunately using Parse::RecDescent is _much_ slower than I would like. So my question is threefold

The data (a file) that I need to parse looks like this:

HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADDRANGE,2468,4680,COMPNAME DELETE,987654321,,COMPNAME DELETERANGE,13579,13599,COMPNAME TLR000004
and the grammer I am using looks like this:
my $Grammar=<<'END_GRAMMAR'; startrule : file file : header record(s?) trailer_t { $return={ header=>$item[1], records=>$item[2], count=>$item[3] } } header : header_t data_t { $return={ company=>$item[1], code=>$item[2] } } record : valid_rec | <error> valid_rec : type_t ',' number_t ',' number_t(?) ',' name_t { $return=[ $item[1], $item[3], @{$item[5]} ? $item[5] : undef, $item[7] ] } header_t : /HDR\w+/ { $return=substr($item[1],3) } trailer_t : /TLR\d+/ { $return=substr($item[1],3) } data_t : /\w+/ type_t : /ADD(?:RANGE)?|DELETE(?:RANGE)/ number_t : /\d+/ name_t : /\w+/ END_GRAMMAR
If its not obvious I have used the postfix _t for tokens.

Any wisdom regarding this would be really appreciated. Especially if there is some way to modify the grammer to enhance speed. These files can contain thousands+ records and the speed hit is seriously making me think of hand rolling this (which I really really dont want to do).

Thanks in advance,

Yves / DeMerphq
--
This space for rent.

Replies are listed 'Best First'.
Re: advice with Parse::RecDescent
by merlyn (Sage) on Dec 10, 2001 at 20:13 UTC
    The current P::RD engine does the equivalent of:
    $text =~ s/^$MATCH//;
    to inch along the string. If $text is small, this is a reasonable way to do it. If $text is large, however, you get a performance hit as you constantly shift the text down in memory.

    I'm told that the next version will walk through the string by maintaining pos (using \G and m//g instead), and that this speeds the process up substantially for large strings. Until then, it helps if you can "predigest" the input stream into logical chunks, and hand each one to a separate invocation of a specific grammar step.

    In your particular case, if you could limit the invocation to separately invoking P::RD methods for header, then each separate record, then trailer, you'd be much speedier.

    You could keep it within one invocation of P::RD by pre-digesting your file into a @queue, then putting just the header into $text, calling the top-level rule, and having the rule for the header suck the next item with $text .= shift @queue, ditto with the rules for record. Of course, backing up would be expensive, but your grammar can commit forward nicely.

    -- Randal L. Schwartz, Perl hacker

      Well, as the data is line by line splitting it into chunks is easy. Im just not sure how to approach it in terms of P::RD code. UPDATE
      I failed to read the last para carefully enough, so probably most of this reply is useless... Sorry. Trying your idea now.
      END UPDATE

      Would I just call startrule on each line seperately? I suppose this means I have to change the grammer? Perhaps so that startrule looks like

      startrule : header | record | trailer
      Is this what you had in mind? The other possibility that occurs to me is to use three seperate parsers each for the different types, but somehow I suspect that is a dead end...

      And thank you very much!

      Yves / DeMerphq
      --
      This space for rent.

      Ok I gave your $text.= @queue idea a try, but it seems unwilling to match multiple records. Im not real sure what Im doing wrong so I have no idea how proceed. Here is a copy of my new grammer with a couple of tother minor changes included. Any ideas of what I've done wrong would be appreciated.
      use strict; use warnings; use Parse::RecDescent; use Data::Dumper; our $RD_TRACE=1; my $Grammar=<<'END_GRAMMAR'; {my $company=""} startrule : file file : header record(s?) trailer_t { $return={header=>$item[1],records=>$item[2],count= +>$item[3]}; 1; } header : header_t data_t { $return={company=>$item[1],code=>$item[2]}; $text.=shift @::text; print "Company Set to $company\n Text=$text"; 1; } record : valid_record | <error> valid_record: type_t ',' number_t ',' number_t(?) ',' "$company" { $return=[ $item[1],$item[3],@{$item[5]} ? $item[5] + : undef ]; $text.=shift @::text; print "Text=$text"; 1; } header_t : /HDR\w+/ {$return=substr($item[1],3); $company=$return +;1;} trailer_t : /TLR\d+/ {$return=substr($item[1],3)} data_t : /\w+/ type_t : /ADD(?:RANGE)?|DELETE(?:RANGE)?/ number_t : /\d+/ END_GRAMMAR my $parser = Parse::RecDescent->new($Grammar) or die "Bad grammar!\n"; our @text=<DATA>; if (defined( my $t=$parser->startrule(shift @text))) { print Dumper($t); } else { print "Bad text!\n"; } __DATA__ HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADD,1234567891,,COMPNAME ADD,1234567892,,COMPNAME ADDRANGE,1468,1680,COMPNAME ADDRANGE,2468,2680,COMPNAME ADDRANGE,3468,3680,COMPNAME DELETE,987654321,,COMPNAME DELETE,987654322,,COMPNAME DELETE,987654323,,COMPNAME DELETERANGE,13579,13599,COMPNAME DELETERANGE,23579,23599,COMPNAME DELETERANGE,33579,33599,COMPNAME TLR000012
      Now weirdly it seems to perform as expected if the same data is organized as so:
      HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADDRANGE,1468,1680,COMPNAME DELETE,987654321,,COMPNAME DELETERANGE,13579,13599,COMPNAME ADD,1234567891,,COMPNAME ADDRANGE,2468,2680,COMPNAME DELETE,987654322,,COMPNAME DELETERANGE,33579,33599,COMPNAME ADD,1234567892,,COMPNAME ADDRANGE,3468,3680,COMPNAME DELETE,987654323,,COMPNAME DELETERANGE,23579,23599,COMPNAME TLR000012
      Which really confuses me! What have I done wrong? Any clues?

      Yves / DeMerphq
      --
      This space for rent.

Re: advice with Parse::RecDescent
by TheDamian (Vicar) on Dec 11, 2001 at 01:56 UTC
    There has been plenty of good advice already, but I suppose I should offer mine anyway. ;-)

    RecDescent is overkill for this project, unless you expect it to grow in complexity (i.e. not just in the number of tags you're handling, but greater structural complexity of the data).

    A good indicator that a grammar is overkill is when it:

    • doesn't have many levels of rules
    • doesn't have many rules with two or more productions
    • doesn't construct a complex, multi-level data structure as it parses
    • does the vast majority of its work with rules that consist of a single regex

    Moreover, when the data is line-based (i.e. each low-level rule in the grammar parses exactly one line), RecDescent is probably not needed.

    Your grammar seems to meet most of those criteria.

    On the other hand, the parsing task you have is very well suited for learning RecDescent.

    If I were implementing a parser for this in real life, rather than as a teaching exercise, I would probably bundle the regexes for each line type into a hash, and then iterate lines, testing against the various alternatives. Like so:

    my $name = qr/(?:\w+)/; my $data = qr/(?:\w+)/; my $num = qr/(?:\d+)/; my %line_is = ( header => qr/HDR($name) ($data)/, trailer => qr/TLR($num)/, additive => qr/(ADDRANGE|ADD|DELETERANGE|DELETE),/, additive_data => qr/($num),($num?),($name)/, ); $_ = qr/\G(?:$_)/ foreach values %line_is; my %data; while (<DATA>) { if (/$line_is{header}/gcx) { $data{header} = { company => $1, code => $2 } } elsif (/$line_is{trailer}/gcx) { $data{trailer} = { count => $1 } } elsif (/$line_is{additive}/gcx) { my $cmd = $1; warn "Bad $cmd: ", substr($_,pos) unless /$line_is{additive_data}/; push @{$data{record}}, [ $cmd, $1, $2||undef, $3 ] } else { warn "Unparsable data: ", substr($_,pos); } } use Data::Dumper 'Dumper'; print Dumper [ \%data ]; __DATA__ HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADDRANGE,2468,4680,COMPNAME DELETE,987654321,,COMPNAME DELETERANGE,13579,13599,COMPNAME TLR000004

    The result is quite readable and maintainable. And fast. Provided, of course, the data remains line-oriented.

    Finally, I do have big plans to rewrite RecDescent to make it much faster (though probably still Pure Perl). The original module was only supposed to be a quick-hack proof-of-concept for self-modifying parsers. It predates the /gc flag; hence the clunky (and slow!) parsing-by-substitution-of-copies idiom.

    But somehow escaped the lab and has subsequently infested a huge number of organizations, which now rely on it.

    There's probably a lesson in that. ;-)

Re: advice with Parse::RecDescent
by miyagawa (Chaplain) on Dec 10, 2001 at 18:20 UTC
    using Parse::RecDescent->Precompile will constribute to your performance improvement.

    --
    Tatsuhiko Miyagawa
    miyagawa@cpan.org

      Well, I tired this, but the speed gain was modest if apparent at all. Seems that it wont speed up the actual parsing until version 2.0. For which I eagerly await!

      Thanks though,

      Yves / DeMerphq
      --
      This space for rent.

Re: advice with Parse::RecDescent
by VSarkiss (Monsignor) on Dec 10, 2001 at 21:32 UTC

    You have some excellent advice above on how to use Parse::RecDescent. My advice to you, however, is not to use it at all.

    A recursive descent parser can deal with very complicated grammars, but the one you're dealing with is very simple. As a matter of fact, it looks like you could write a loop that reads a line, then dispatches based on its first three characters. Using a full-blown recursive descent parser in this situation is like swatting a fly with a sledgehammer. (Unless the grammar above is just an illustration and the real one is more complicated.)

    As for maintainability: if the code can really be as simple as I've outlined above, then I'd say the simplicity would make a greater contribution to maintainability than the use of a standard module would. In other words, I think reading through a simple read/dispatch loop with four clauses -- HDR, TLR, ADD, DEL -- is as simple (if not simpler) than digesting the grammar above. And simplicity is one of the keys to maintainability.

      For what it's worth, I have to agree with you, VSarkiss. Just looking at the input data immediately makes me think of code that looks something like this:

      # ... file is opened in handle FH my ($compname, $hdrdata, $tlrdata ); while (<FH>) { /^HDR(\w+)\s+(\w+)/ && do { $compname=$1; $hdrdata=$2; # handle a header next; }; /^TLR(\d+)/ && do { $tlrdata=$1; # handle a trailer with number=$1 next; }; /^ADD(?:RANGE)?|DELETE(?:RANGE)?/ && do { my ($type,@range_data) = split /,/; my $comp_rec = pop @range_data; if ($comp_name eq $comp_rec ) { # validate record # process type, range data, etc. } else { # error? } next; }; # any other kind of line falls through here # error: ... }

      Voila! Simplicity itself, no? Well, okay. Maybe not, but certainly straightforward.

      dmm

      
      You can give a man a fish and feed him for a day ...
      Or, you can teach him to fish and feed him for a lifetime
      
      Perhaps so. My key worry is that as time goes on the spec will change and most probably towards something more complicated.

      Nevertheless maybe not using this tool is the best course of action after all. I will let you know how I get on.

      Yves / DeMerphq
      --
      This space for rent.

Re: advice with Parse::RecDescent
by gbarr (Monk) on Dec 10, 2001 at 21:14 UTC
    The way in which P::RD is implemented gives it great flexability and power. But this comes at the cost of performance.

    If speed is of a real concern then I would suggest you look at byacc and the perl output mode. Although you will still need to write your own lexer.

    This is what I have done in the past, and can bee seen in the parser included in Convert::ASN1

      If you are thinking of moving to an LALR grammer also take a look at Parse::Yapp.

      Not sure how it compares to byacc with perl output (any ideas anyone?), but together with Parse::Lex made a nice combination for me.

      --

      Nic
Re: advice with Parse::RecDescent
by princepawn (Parson) on Dec 10, 2001 at 22:10 UTC