I have a Perl parser for a technical language (details of which are unimportant here). The parser is handed a text file which is of the order of 400KB or larger, read in as a single scalar (which takes a fraction of a second). The parser puts the string into $_ and then uses a series of constructs like those below:
if (m/\G menu \s* \( \s* $RXstr \s* \) \s* \{/oxgc) { my $name = $1; parse_menu($name); } elsif (m/\G driver \s* \( \s* $RXstr \s* \)/oxgc) { my $name = $1; parse_driver($name); }
The $RXstr used above is defined as:
our $RXname = qr/ [a-zA-Z0-9_\-:.\[\]<>;]+ /x; our $RXhex = qr/ (?: 0 [xX] [0-9A-Fa-f]+ ) /x; our $RXoct = qr/ 0 [0-7]* /x; our $RXuint = qr/ [0-9]+ /x; our $RXint = qr/ -? $RXuint /ox; our $RXuintx = qr/ ( $RXhex | $RXoct | $RXuint ) /ox; our $RXintx = qr/ ( $RXhex | $RXoct | $RXint ) /ox; our $RXnum = qr/ -? (?: [0-9]+ | [0-9]* \. [0-9]+ ) (?: [eE] [-+]? [ +0-9]+ )? /x; our $RXdqs = qr/ " (?: [^"] | \\" )* " /x; our $RXstr = qr/ ( $RXname | $RXnum | $RXdqs ) /ox;
The individual parse_menu() and parse_driver() routines called in the first code segment above continue parsing from where the previous match succeeded using similar constructs.
This works fine and performs well on Perl versions up until Perl 5.20. Here are some results from running this program under 3 different versions of Perl, measured on MacOS but the regression has been reported on Debian and Ubuntu:
woz$ perlbrew use 5.18.0 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m0.461s user 0m0.380s sys 0m0.020s woz$ perlbrew use 5.20.0 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m14.656s user 0m13.548s sys 0m0.075s woz$ perlbrew use 5.24.1 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m9.518s user 0m8.977s sys 0m0.044s
Using NYTProf I have profiled the code and the additional time in the later Perl versions is all attributable to the Parser::CORE:match (opcode). It calculates there are 99062 calls to that opcode in that time period for this particular 406KB input file, spread across 9 separate routines in the parser.
This is obviously a bad regression.
Can anyone advise me how to modify my parser code so it performs well on all versions of Perl? There are other programmers on this project who would love to replace the Perl code with Python, which I really don't think we should need to do, but this level of a performance regression is a problem.
Thanks for any advice...
In reply to Parser Performance Question by songmaster
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |