As the original case is some parser like this:

sub parseAny { #my $p = shift; # pkg or doc my $c = shift; #my $objnum = shift; #my $gennum = shift; return ${$c} =~ m/ \G \d+\s+\d+\s+R\b /xms ? 'parseRef( $c, $ +objnum, $gennum)' : ${$c} =~ m{ \G / }xms ? 'parseLabel( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G << /xms ? 'parseDict( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G \[ /xms ? 'parseArray( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G [(] /xms ? 'parseString( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G < /xms ? 'parseHexString($c, $ +objnum, $gennum)' : ${$c} =~ m/ \G [\d.+-]+ /xms ? 'parseNum( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G (true|false) /ixms ? 'parseBoolean( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G null /ixms ? 'parseNull( $c, $ +objnum, $gennum)' : die "Unrecognized type in parseAny\n"; }

I think the best approach would be a tokenizer that takes the first character and decides from that what to do. This would mean rewriting the regex into something really unreadable like:

sub parseAny_token { #my $p = shift; # pkg or doc my $c = shift; #my $objnum = shift; #my $gennum = shift; my $ch = m{ \G (?: ([0-9]+) |(/) |(<<) |(\[) |\([.+-]) |(true|false) |(null) }xmsi } or die "Unrecognized type in parseAny\n"; # now dispatch based on $1 etc: if( defined $1 ) { my $num = $1; if( m/\G\s+\d+\s+R\b/ ) { # Handle $num $num $ parseRef( $c, $objnum, $gennum) } elsif( m/\G([-+.\d+])/ ) { # handle "$num$1" "$num$1 parseNum( $c, $objnum, $gennum) } else { # handle "$num" }; } elsif( defined $2 ) { # / parseLabel( $c, $objnum, $gennum) } ... }

That would need a lot of good unit tests to make sure the grammar rewrite still works and especially still picks up parsing at the right places when something like a +R comes in the input stream.

In my toy implementation for the tokenizer, I get 90% of the performance of the original R case for both cases. Maybe it would be worth to share your problematic PDF with the author of CAM::PDF (or me) if you can, just to see whether it can be turned into a good test case...


In reply to Re^2: Why is Perl suddenly slow in THIS case? by Corion
in thread Why is Perl suddenly slow in THIS case? by vr

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.