Re: To parse or not to parse

Why?

You have a program that has been working fine for 8 months. It breaks the problem up in small pieces, and it uses a simple solution for all the pieces. From the description you are giving, about the only thing that's against it is reading in a file twice, and some streamlining could have been programmed in. But overall, you have a pretty good program, and it works.

But now you have an urge to rewrite it. And what my impression is, in a more complex way. I'd say, don't do it.

I think I am that point but what criteria do I use to determine whether I should stop using plain regexes and consider using Parse::RecDescent?

Never. If you can do it with plain regexes, do it with plain regexes. Regexes can backtrack, but Parse::RecDescent will backtrack much more, unless you put a lot of work in it. Parse::RecDescent is slow. You'd only use Parse::RecDescent if your problem can't be solved with plain regexes. And even then you might want to consider regexes with advanced features over Parse::RecDescent. I can't prove it (yet), but I think that anything that can be parsed with a Parse::RecDescent parser, can be parsed with a regex (except perhaps cases where you modify the parser mid-parse).

Abigail

Comment on Re: To parse or not to parse

Replies are listed 'Best First'.
Re: Re: To parse or not to parse by set_uk (Pilgrim) on Nov 13, 2003 at 23:47 UTC
Thats the kind of steer that I need. Perhaps you can help with problem of eliminating the garbage in a more efficient manner I currently use the following to identify and remove garbage from a file. I compile the regexes because I am using them repeatedly over and over again different lines in the file. I have a vague recollection that `(?<![A-Z])(AUD.)` using lookbehind and lookaheads causes a large overhead. Runs really slowly hogging all the memory on a file with 150000 lines for 20 mins. No doubt this can be improved:- sub ValidateInputFile{ my($line, $llen, $LineNo, $ErrChar, $ErrCount, $mtch_junk, $tncount, @check, $SOF_fh); $tncount=0; $ErrCount=0; $SOF_fh=OpenFile($fqd_sof, "READ"); $SOFV_fh=OpenFile($fqd_sof.'validated', "WRITE"); $SOFR_fh=OpenFile($fqd_sof.'removed', "WRITE"); @HeaderGarbage = ('^LD 20', '^PT0000 ', '^REQ\: PRT', '^TYPE\: TNB', '^TN $', '^CDEN ', '^CUST ', '^DATE ', '^PAGE ', '^DES '); foreach $RegEx (@HeaderGarbage){ $CHRegex = qr/($RegEx\n)/; push @HeaderRegExes, $CHRegex; } @ErrCodes = ('(^\.\n)', '(ADMIN VSID.)', '(BUG.\n)', '(?<![A-Z])(AUD.)', '(SCH.\n)', '(MAINT.\n)', '(OPRDATA.\n)', '(TIM000.\n)', '(DTC103.)', '(DTC001.)', '(DSL.TIME.\n)', '(FIR110.\n)', '(FIR103.\n)', '(^ERR.\n)', '(DCH:.)', '(^VAS008.\n)', '(^INDEX.VTN.\n)', '(^IND.\n)', '(^ NET 56 P.*\n)', '(00A1491A 00000C16 00000000 000014F0 00000000 000000 +00 00000000 128)', '(0000 00000000 0 00000000 00000000 00000000 0000000 +0 00000000 00000000)'); foreach $RegEx (@ErrCodes){ $CBRegex = qr/$RegEx/; push @BodyRegExes, $CBRegex; } while ($line = ($SOF_fh->getline)){ $LineNo++; $llen=length($line); #Remove the DOS line endings $line=~s/\r\n/\n/; # Following code removes spurious switch codes that cant be suppre +ssed # First lets strip the first 20 lines of the TNB file of garbage if ($LineNo < 30){ #print $LineNo."\n"; #print $line; foreach $ErrChar (@HeaderRegExes){ undef $mtch_junk; #print "Checking ".$ErrChar."\n"; if ($line=~/$ErrChar/){ $ErrCount++; #print " Found ".$ErrChar."\n"; $line=~s/$ErrChar//; $mtch_junk = $1; if ($LineNo != $LastInvLine){ # Only print the line the first time we go over it # Because $1 isn't reset if a subsequent attempt t +o match fails if ($mtch_junk){ $LastInvLine = $LineNo; $SOFR_fh->write("$LineNo($ErrChar)->$mtch_ +junk\n"); } } } } } # Now attempt to remove spurious codes foreach $ErrChar (@BodyRegExes){ undef $mtch_junk; # Need to ensure this doesn't match BAUD just AUD if ($line=~/$ErrChar/){ $ErrCount++; $line=~s/$ErrChar//; $mtch_junk = $1; if ($LineNo != $LastInvLine){ # Only print the line the first time we go over it # Because $1 isn't reset if a subsequent attempt t +o match fails if ($mtch_junk){ $LastInvLine = $LineNo; $SOFR_fh->write("$LineNo($ErrChar)->$mtch_junk\n") +; } } } } if ($line =~ /^(TN\s+\d+\s+\d+)/) { $tncount++; #$LOG_fh->write(msg("INFO","Found $1","FORMAT"),8); #print $1."\n"; } $SOFV_fh->write($line); } $LOG_fh->write(msg("INFO","Removed $ErrCount occurrences of non TN dat +a","FORMAT"),4); msg("INFO","Removed $ErrCount occurrences of non TN data","PRINT"); $LOG_fh->write(msg("INFO","Found $tncount occurrences of TN at beginni +ng of line","FORMAT"),4); msg("INFO","Found $tncount occurrences of TN at beginning of line","PR +INT"); $SwitchCount{$switch_id}{'REMOVED'} = $ErrCount; $SwitchCount{$switch_id}{'TN_COUNT'} = $tncount; $SOF_fh->close or throw TNBCriticalException("Cannot close switch outp +ut file"); undef $SOF_fh; undef $SOFV_fh; undef $SOFR_fh; return $tncount; } [download] Simon	[reply] [d/l] [select]
Re: Re: Re: To parse or not to parse by Roger (Parson) on Nov 14, 2003 at 00:50 UTC
I agree with Abigail-II that the Parse::RecDescent module is not going to make your program run faster or better, it will only introduce yet more maintenance work. I think 2 passes on the data file is not necessary. You can do all in a single pass, just drop the garbage lines as you loop through the input file. Also with the precompiled regular expression, you did - `foreach $RegEx (@HeaderGarbage){ $CHRegex = qr/($RegEx\n)/; push @HeaderRegExes, $CHRegex; } ... foreach $ErrChar (@HeaderRegExes){ ...` [download] Why not just compile everything in your header garbage list into a single regular expression? `@HeaderGarbage = ('LD 20', 'PT0000 ', ... 'CUST ', 'DATE ', 'PAGE ', 'DES '); my $reg = join '\|', @HeaderGarbage; my $HeaderRegex = qr/^$reg/; ... while (my $line = <FILE>) { $line =~ s/$HeaderRegex//; # replace matching garbage ...` [download] Read more... (3 kB)	[reply] [d/l] [select]