I have a Perl parser for a technical language (details of which are unimportant here). The parser is handed a text file which is of the order of 400KB or larger, read in as a single scalar (which takes a fraction of a second). The parser puts the string into $_ and then uses a series of constructs like those below:

if (m/\G menu \s* \( \s* $RXstr \s* \) \s* \{/oxgc) { my $name = $1; parse_menu($name); } elsif (m/\G driver \s* \( \s* $RXstr \s* \)/oxgc) { my $name = $1; parse_driver($name); }

The $RXstr used above is defined as:

our $RXname = qr/ [a-zA-Z0-9_\-:.\[\]<>;]+ /x; our $RXhex = qr/ (?: 0 [xX] [0-9A-Fa-f]+ ) /x; our $RXoct = qr/ 0 [0-7]* /x; our $RXuint = qr/ [0-9]+ /x; our $RXint = qr/ -? $RXuint /ox; our $RXuintx = qr/ ( $RXhex | $RXoct | $RXuint ) /ox; our $RXintx = qr/ ( $RXhex | $RXoct | $RXint ) /ox; our $RXnum = qr/ -? (?: [0-9]+ | [0-9]* \. [0-9]+ ) (?: [eE] [-+]? [ +0-9]+ )? /x; our $RXdqs = qr/ " (?: [^"] | \\" )* " /x; our $RXstr = qr/ ( $RXname | $RXnum | $RXdqs ) /ox;

The individual parse_menu() and parse_driver() routines called in the first code segment above continue parsing from where the previous match succeeded using similar constructs.

This works fine and performs well on Perl versions up until Perl 5.20. Here are some results from running this program under 3 different versions of Perl, measured on MacOS but the regression has been reported on Debian and Ubuntu:

woz$ perlbrew use 5.18.0 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m0.461s user 0m0.380s sys 0m0.020s woz$ perlbrew use 5.20.0 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m14.656s user 0m13.548s sys 0m0.075s woz$ perlbrew use 5.24.1 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m9.518s user 0m8.977s sys 0m0.044s

Using NYTProf I have profiled the code and the additional time in the later Perl versions is all attributable to the Parser::CORE:match (opcode). It calculates there are 99062 calls to that opcode in that time period for this particular 406KB input file, spread across 9 separate routines in the parser.

This is obviously a bad regression.

Can anyone advise me how to modify my parser code so it performs well on all versions of Perl? There are other programmers on this project who would love to replace the Perl code with Python, which I really don't think we should need to do, but this level of a performance regression is a problem.

Thanks for any advice...


In reply to Parser Performance Question by songmaster

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.