comment on

I have a Perl parser for a technical language (details of which are unimportant here). The parser is handed a text file which is of the order of 400KB or larger, read in as a single scalar (which takes a fraction of a second). The parser puts the string into $_ and then uses a series of constructs like those below:

    if (m/\G menu \s* \( \s* $RXstr \s* \) \s* \{/oxgc) {
        my $name = $1;
        parse_menu($name);
    }
    elsif (m/\G driver \s* \( \s* $RXstr \s* \)/oxgc) {
        my $name = $1;
        parse_driver($name);
    }
[download]

The $RXstr used above is defined as:

our $RXname =  qr/ [a-zA-Z0-9_\-:.\[\]<>;]+ /x;
our $RXhex =   qr/ (?: 0 [xX] [0-9A-Fa-f]+ ) /x;
our $RXoct =   qr/ 0 [0-7]* /x;
our $RXuint =  qr/ [0-9]+ /x;
our $RXint =   qr/ -? $RXuint /ox;
our $RXuintx = qr/ ( $RXhex | $RXoct | $RXuint ) /ox;
our $RXintx =  qr/ ( $RXhex | $RXoct | $RXint ) /ox;
our $RXnum =   qr/ -? (?: [0-9]+ | [0-9]* \. [0-9]+ ) (?: [eE] [-+]? [
+0-9]+ )? /x;
our $RXdqs =   qr/ " (?: [^"] | \\" )* " /x;
our $RXstr =   qr/ ( $RXname | $RXnum | $RXdqs ) /ox;
[download]

The individual parse_menu() and parse_driver() routines called in the first code segment above continue parsing from where the previous match succeeded using similar constructs.

This works fine and performs well on Perl versions up until Perl 5.20. Here are some results from running this program under 3 different versions of Perl, measured on MacOS but the regression has been reported on Debian and Ubuntu:

woz$ perlbrew use 5.18.0
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd

real    0m0.461s
user    0m0.380s
sys     0m0.020s
woz$ perlbrew use 5.20.0
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd

real    0m14.656s
user    0m13.548s
sys     0m0.075s
woz$ perlbrew use 5.24.1
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd

real    0m9.518s
user    0m8.977s
sys     0m0.044s
[download]

Using NYTProf I have profiled the code and the additional time in the later Perl versions is all attributable to the Parser::CORE:match (opcode). It calculates there are 99062 calls to that opcode in that time period for this particular 406KB input file, spread across 9 separate routines in the parser.

This is obviously a bad regression.

Can anyone advise me how to modify my parser code so it performs well on all versions of Perl? There are other programmers on this project who would love to replace the Perl code with Python, which I really don't think we should need to do, but this level of a performance regression is a problem.

Thanks for any advice...

In reply to Parser Performance Question by songmaster

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.