advice with Parse::RecDescent

demerphq has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: advice with Parse::RecDescent by merlyn (Sage) on Dec 10, 2001 at 20:13 UTC
The current P::RD engine does the equivalent of: `$text =~ s/^$MATCH//;` [download] to inch along the string. If `$text` is small, this is a reasonable way to do it. If `$text` is large, however, you get a performance hit as you constantly shift the text down in memory. I'm told that the next version will walk through the string by maintaining `pos` (using `\G` and `m//g` instead), and that this speeds the process up substantially for large strings. Until then, it helps if you can "predigest" the input stream into logical chunks, and hand each one to a separate invocation of a specific grammar step. In your particular case, if you could limit the invocation to separately invoking P::RD methods for `header`, then each separate `record`, then `trailer`, you'd be much speedier. You could keep it within one invocation of P::RD by pre-digesting your file into a `@queue`, then putting just the header into `$text`, calling the top-level rule, and having the rule for the header suck the next item with `$text .= shift @queue`, ditto with the rules for `record`. Of course, backing up would be expensive, but your grammar can commit forward nicely. -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Re: advice with Parse::RecDescent by demerphq (Chancellor) on Dec 10, 2001 at 20:24 UTC
Well, as the data is line by line splitting it into chunks is easy. Im just not sure how to approach it in terms of P::RD code. UPDATE I failed to read the last para carefully enough, so probably most of this reply is useless... Sorry. Trying your idea now. END UPDATE Would I just call startrule on each line seperately? I suppose this means I have to change the grammer? Perhaps so that startrule looks like `startrule : header \| record \| trailer` [download] Is this what you had in mind? The other possibility that occurs to me is to use three seperate parsers each for the different types, but somehow I suspect that is a dead end... And thank you very much! Yves / DeMerphq -- This space for rent.	[reply] [d/l]
Re: Re: advice with Parse::RecDescent by demerphq (Chancellor) on Dec 10, 2001 at 21:36 UTC
Ok I gave your $text.= @queue idea a try, but it seems unwilling to match multiple records. Im not real sure what Im doing wrong so I have no idea how proceed. Here is a copy of my new grammer with a couple of tother minor changes included. Any ideas of what I've done wrong would be appreciated. use strict; use warnings; use Parse::RecDescent; use Data::Dumper; our $RD_TRACE=1; my $Grammar=<<'END_GRAMMAR'; {my $company=""} startrule : file file : header record(s?) trailer_t { $return={header=>$item[1],records=>$item[2],count= +>$item[3]}; 1; } header : header_t data_t { $return={company=>$item[1],code=>$item[2]}; $text.=shift @::text; print "Company Set to $company\n Text=$text"; 1; } record : valid_record \| <error> valid_record: type_t ',' number_t ',' number_t(?) ',' "$company" { $return=[ $item[1],$item[3],@{$item[5]} ? $item[5] + : undef ]; $text.=shift @::text; print "Text=$text"; 1; } header_t : /HDR\w+/ {$return=substr($item[1],3); $company=$return +;1;} trailer_t : /TLR\d+/ {$return=substr($item[1],3)} data_t : /\w+/ type_t : /ADD(?:RANGE)?\|DELETE(?:RANGE)?/ number_t : /\d+/ END_GRAMMAR my $parser = Parse::RecDescent->new($Grammar) or die "Bad grammar!\n"; our @text=<DATA>; if (defined( my $t=$parser->startrule(shift @text))) { print Dumper($t); } else { print "Bad text!\n"; } __DATA__ HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADD,1234567891,,COMPNAME ADD,1234567892,,COMPNAME ADDRANGE,1468,1680,COMPNAME ADDRANGE,2468,2680,COMPNAME ADDRANGE,3468,3680,COMPNAME DELETE,987654321,,COMPNAME DELETE,987654322,,COMPNAME DELETE,987654323,,COMPNAME DELETERANGE,13579,13599,COMPNAME DELETERANGE,23579,23599,COMPNAME DELETERANGE,33579,33599,COMPNAME TLR000012 [download] Now weirdly it seems to perform as expected if the same data is organized as so: `HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADDRANGE,1468,1680,COMPNAME DELETE,987654321,,COMPNAME DELETERANGE,13579,13599,COMPNAME ADD,1234567891,,COMPNAME ADDRANGE,2468,2680,COMPNAME DELETE,987654322,,COMPNAME DELETERANGE,33579,33599,COMPNAME ADD,1234567892,,COMPNAME ADDRANGE,3468,3680,COMPNAME DELETE,987654323,,COMPNAME DELETERANGE,23579,23599,COMPNAME TLR000012` [download] Which really confuses me! What have I done wrong? Any clues? Yves / DeMerphq -- This space for rent.	[reply] [d/l] [select]
Re: advice with Parse::RecDescent by TheDamian (Vicar) on Dec 11, 2001 at 01:56 UTC
There has been plenty of good advice already, but I suppose I should offer mine anyway. `;-)` RecDescent is overkill for this project, unless you expect it to grow in complexity (i.e. not just in the number of tags you're handling, but greater structural complexity of the data). A good indicator that a grammar is overkill is when it: doesn't have many levels of rules doesn't have many rules with two or more productions doesn't construct a complex, multi-level data structure as it parses does the vast majority of its work with rules that consist of a single regex Moreover, when the data is line-based (i.e. each low-level rule in the grammar parses exactly one line), RecDescent is probably not needed. Your grammar seems to meet most of those criteria. On the other hand, the parsing task you have is very well suited for learning RecDescent. If I were implementing a parser for this in real life, rather than as a teaching exercise, I would probably bundle the regexes for each line type into a hash, and then iterate lines, testing against the various alternatives. Like so: my $name = qr/(?:\w+)/; my $data = qr/(?:\w+)/; my $num = qr/(?:\d+)/; my %line_is = ( header => qr/HDR($name) ($data)/, trailer => qr/TLR($num)/, additive => qr/(ADDRANGE\|ADD\|DELETERANGE\|DELETE),/, additive_data => qr/($num),($num?),($name)/, ); $_ = qr/\G(?:$_)/ foreach values %line_is; my %data; while (<DATA>) { if (/$line_is{header}/gcx) { $data{header} = { company => $1, code => $2 } } elsif (/$line_is{trailer}/gcx) { $data{trailer} = { count => $1 } } elsif (/$line_is{additive}/gcx) { my $cmd = $1; warn "Bad $cmd: ", substr($_,pos) unless /$line_is{additive_data}/; push @{$data{record}}, [ $cmd, $1, $2\|\|undef, $3 ] } else { warn "Unparsable data: ", substr($_,pos); } } use Data::Dumper 'Dumper'; print Dumper [ \%data ]; __DATA__ HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADDRANGE,2468,4680,COMPNAME DELETE,987654321,,COMPNAME DELETERANGE,13579,13599,COMPNAME TLR000004 [download] The result is quite readable and maintainable. And fast. Provided, of course, the data remains line-oriented. Finally, I do have big plans to rewrite RecDescent to make it much faster (though probably still Pure Perl). The original module was only supposed to be a quick-hack proof-of-concept for self-modifying parsers. It predates the `/gc` flag; hence the clunky (and slow!) parsing-by-substitution-of-copies idiom. But somehow escaped the lab and has subsequently infested a huge number of organizations, which now rely on it. There's probably a lesson in that. `;-)`	[reply] [d/l]
Re: advice with Parse::RecDescent by miyagawa (Chaplain) on Dec 10, 2001 at 18:20 UTC
using `Parse::RecDescent->Precompile` will constribute to your performance improvement. -- Tatsuhiko Miyagawa miyagawa@cpan.org	[reply] [d/l]
Re: Re: advice with Parse::RecDescent by demerphq (Chancellor) on Dec 10, 2001 at 18:56 UTC
Well, I tired this, but the speed gain was modest if apparent at all. Seems that it wont speed up the actual parsing until version 2.0. For which I eagerly await! Thanks though, Yves / DeMerphq -- This space for rent.	[reply]
Re: advice with Parse::RecDescent by VSarkiss (Monsignor) on Dec 10, 2001 at 21:32 UTC
You have some excellent advice above on how to use Parse::RecDescent. My advice to you, however, is not to use it at all. A recursive descent parser can deal with very complicated grammars, but the one you're dealing with is very simple. As a matter of fact, it looks like you could write a loop that reads a line, then dispatches based on its first three characters. Using a full-blown recursive descent parser in this situation is like swatting a fly with a sledgehammer. (Unless the grammar above is just an illustration and the real one is more complicated.) As for maintainability: if the code can really be as simple as I've outlined above, then I'd say the simplicity would make a greater contribution to maintainability than the use of a standard module would. In other words, I think reading through a simple read/dispatch loop with four clauses -- HDR, TLR, ADD, DEL -- is as simple (if not simpler) than digesting the grammar above. And simplicity is one of the keys to maintainability.	[reply]
Re(2): advice with Parse::RecDescent by dmmiller2k (Chaplain) on Dec 10, 2001 at 23:47 UTC
For what it's worth, I have to agree with you, VSarkiss. Just looking at the input data immediately makes me think of code that looks something like this: # ... file is opened in handle FH my ($compname, $hdrdata, $tlrdata ); while (<FH>) { /^HDR(\w+)\s+(\w+)/ && do { $compname=$1; $hdrdata=$2; # handle a header next; }; /^TLR(\d+)/ && do { $tlrdata=$1; # handle a trailer with number=$1 next; }; /^ADD(?:RANGE)?\|DELETE(?:RANGE)?/ && do { my ($type,@range_data) = split /,/; my $comp_rec = pop @range_data; if ($comp_name eq $comp_rec ) { # validate record # process type, range data, etc. } else { # error? } next; }; # any other kind of line falls through here # error: ... } [download] Voila! Simplicity itself, no? Well, okay. Maybe not, but certainly straightforward. dmm You can give a man a fish and feed him for a day ... Or, you can teach him to fish and feed him for a lifetime	[reply] [d/l]
Re: Re: advice with Parse::RecDescent by demerphq (Chancellor) on Dec 10, 2001 at 21:51 UTC
Perhaps so. My key worry is that as time goes on the spec will change and most probably towards something more complicated. Nevertheless maybe not using this tool is the best course of action after all. I will let you know how I get on. Yves / DeMerphq -- This space for rent.	[reply]
Re: advice with Parse::RecDescent by gbarr (Monk) on Dec 10, 2001 at 21:14 UTC
The way in which P::RD is implemented gives it great flexability and power. But this comes at the cost of performance. If speed is of a real concern then I would suggest you look at byacc and the perl output mode. Although you will still need to write your own lexer. This is what I have done in the past, and can bee seen in the parser included in Convert::ASN1	[reply]
Re: Re: advice with Parse::RecDescent by NicS (Scribe) on Dec 10, 2001 at 22:04 UTC
If you are thinking of moving to an LALR grammer also take a look at Parse::Yapp. Not sure how it compares to byacc with perl output (any ideas anyone?), but together with Parse::Lex made a nice combination for me. -- Nic	[reply]
Re: advice with Parse::RecDescent by princepawn (Parson) on Dec 10, 2001 at 22:10 UTC
there are several suggestions for speeding up P::RD grammars in the Parse::RecDescent::FAQ	[reply]