vr has asked for the wisdom of the Perl Monks concerning the following question:

use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 42_000; my $Q = 'Q' x 42_000; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } cmpthese -3, { R => sub { foo( \$R )}, Q => sub { foo( \$Q )}, }

C:\>perl test.pl Rate Q R Q 1.30/s -- -96% R 30.9/s 2279% --

I'd expect both strings should behave the same (match fails immediately) in line which starts with next, but it's obviously not so. There's some weird internal "optimization" going on here, but why?

And it's not just idle curiosity, dear monks. Look here. Syntax like 42 0 R is "indirect object" ("R" for "reference"), i.e. "R" is kind of "special". It so happens that huge (150 Mb) page content has no "R"'s, and it takes forever to parse this string :(.

Replies are listed 'Best First'.
Re: Why is Perl suddenly slow in THIS case?
by Corion (Patriarch) on Mar 05, 2017 at 19:28 UTC

    Perl is doing much more (but smarter) work when it finds a R in the string:

    use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 10; my $Q = 'Q' x 10; use re 'debug'; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } print "-" x 80, "\n"; foo( \$Q ); print "-" x 80, "\n"; foo( \$R ); print "-" x 80, "\n";
    Matching REx "\G\d+R" against "QQQQQQQQQQ" Intuit: trying to determine minimum start position... Did not find floating substr "R"... Match rejected by optimizer Matching REx "\w" against "QQQQQQQQQQ" Matching stclass POSIXD[\w] against "QQQQQQQQQQ" (10 bytes) 0 <> <QQQQQQQQQQ> | 1:POSIXD[\w](2) 1 <Q> <QQQQQQQQQ> | 2:END(0) Match successful! ... Matching REx "\G\d+R" against "QQQQQQQQQQ" Regex match can't succeed, so not even tried Matching REx "\w" against "Q" Matching stclass POSIXD[\w] against "Q" (1 bytes) 9 <QQQQQQQQQ> <Q> | 1:POSIXD[\w](2) 10 <QQQQQQQQQQ> <> | 2:END(0) Match successful! Matching REx "\G\d+R" against "QQQQQQQQQQ" Regex match can't succeed, so not even tried Matching REx "\w" against "" Regex match can't succeed, so not even tried ---------------------------------------------------- ---------------------------------------------------------------------- +---------- Matching REx "\G\d+R" against "RRRRRRRRRR" Intuit: trying to determine minimum start position... Found floating substr "R" at offset 1... (multiline anchor test skipped) looking for class: start_shift: 1 check_at: 1 rx_origin: 0 endpos: 1 This position contradicts STCLASS... Match rejected by optimizer Matching REx "\w" against "RRRRRRRRRR" Matching stclass POSIXD[\w] against "RRRRRRRRRR" (10 bytes) 0 <> <RRRRRRRRRR> | 1:POSIXD[\w](2) 1 <R> <RRRRRRRRR> | 2:END(0) Match successful! Matching REx "\G\d+R" against "RRRRRRRRRR" Intuit: trying to determine minimum start position... Found floating substr "R" at offset 1... (multiline anchor test skipped) looking for class: start_shift: 1 check_at: 2 rx_origin: 1 endpos: 2 This position contradicts STCLASS... Match rejected by optimizer Matching REx "\w" against "RRRRRRRRR" Matching stclass POSIXD[\w] against "RRRRRRRRR" (9 bytes) 1 <R> <RRRRRRRRR> | 1:POSIXD[\w](2) 2 <RR> <RRRRRRRR> | 2:END(0) Match successful! ...

    I'm not sure how you could make Perl try not to scan the string for the fixed substring R but immediately look for \G\d to immediately reject the parse at that location, no matter what follows.

    By splitting up the parsing between \d+ and R, I can make both cases behave slightly 18% worse than \G\d+R, but I'm not sure if that's correct:

    use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 42_000; my $Q = 'Q' x 42_000; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } sub foo_twostep { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G[0-9]+/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } cmpthese -3, { R => sub { foo( \$R )}, Q => sub { foo( \$Q )}, R_twostep => sub { foo_twostep( \$R )}, Q_twostep => sub { foo_twostep( \$Q )}, } __END__ Rate Q R_twostep Q_twostep R Q 2.01/s -- -97% -97% -97% R_twostep 65.3/s 3147% -- 0% -15% Q_twostep 65.3/s 3147% 0% -- -15% R 77.1/s 3732% 18% 18% --

    Update: Using only ASCII digits, I can get a slowdown of 12%:

    use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 42_000; my $Q = 'Q' x 42_000; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } sub foo_twostep { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G\d+/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } sub foo_asciidigits { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G[0-9]/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } cmpthese -3, { R => sub { foo( \$R )}, Q => sub { foo( \$Q )}, R_twostep => sub { foo_twostep( \$R )}, Q_twostep => sub { foo_twostep( \$Q )}, R_ascii => sub { foo_asciidigits( \$R )}, Q_ascii => sub { foo_asciidigits( \$Q )}, } __END__ Rate Q Q_twostep R_twostep R_ascii Q_ascii + R Q 1.99/s -- -97% -97% -97% -97% + -97% Q_twostep 62.9/s 3054% -- -1% -7% -8% + -17% R_twostep 63.8/s 3101% 1% -- -5% -6% + -16% R_ascii 67.5/s 3287% 7% 6% -- -1% + -11% Q_ascii 68.1/s 3319% 8% 7% 1% -- + -10% R 76.0/s 3715% 21% 19% 13% 12% + --

      I just realized that my two-step approach is not the same, as it advances \G whenever it would find \d+ even if not followed by R. That may or may not be desireable in the more general PDF parsing code.

        It makes sense now, with your explanation, thank you.

        Intuit: trying to determine minimum start position...

        But isn't this "start position" already determined by \G? :)

        In my case it would be easy. Indirect objects can't happen in content streams, so modified (without first check) parseAny_in_stream can be added. It then is called as class method in CAM::PDF::Content. On the other hand, the 150 Mb "R"-less streams are probably too rare to justify any trouble making this fix.

      As the original case is some parser like this:

      sub parseAny { #my $p = shift; # pkg or doc my $c = shift; #my $objnum = shift; #my $gennum = shift; return ${$c} =~ m/ \G \d+\s+\d+\s+R\b /xms ? 'parseRef( $c, $ +objnum, $gennum)' : ${$c} =~ m{ \G / }xms ? 'parseLabel( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G << /xms ? 'parseDict( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G \[ /xms ? 'parseArray( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G [(] /xms ? 'parseString( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G < /xms ? 'parseHexString($c, $ +objnum, $gennum)' : ${$c} =~ m/ \G [\d.+-]+ /xms ? 'parseNum( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G (true|false) /ixms ? 'parseBoolean( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G null /ixms ? 'parseNull( $c, $ +objnum, $gennum)' : die "Unrecognized type in parseAny\n"; }

      I think the best approach would be a tokenizer that takes the first character and decides from that what to do. This would mean rewriting the regex into something really unreadable like:

      sub parseAny_token { #my $p = shift; # pkg or doc my $c = shift; #my $objnum = shift; #my $gennum = shift; my $ch = m{ \G (?: ([0-9]+) |(/) |(<<) |(\[) |\([.+-]) |(true|false) |(null) }xmsi } or die "Unrecognized type in parseAny\n"; # now dispatch based on $1 etc: if( defined $1 ) { my $num = $1; if( m/\G\s+\d+\s+R\b/ ) { # Handle $num $num $ parseRef( $c, $objnum, $gennum) } elsif( m/\G([-+.\d+])/ ) { # handle "$num$1" "$num$1 parseNum( $c, $objnum, $gennum) } else { # handle "$num" }; } elsif( defined $2 ) { # / parseLabel( $c, $objnum, $gennum) } ... }

      That would need a lot of good unit tests to make sure the grammar rewrite still works and especially still picks up parsing at the right places when something like a +R comes in the input stream.

      In my toy implementation for the tokenizer, I get 90% of the performance of the original R case for both cases. Maybe it would be worth to share your problematic PDF with the author of CAM::PDF (or me) if you can, just to see whether it can be turned into a good test case...

        A particular file was chosen as extreme edge case. Huge PDFs are common, but their size is mostly because of images. Here it's "pure vector graphics", and I experimented with how soon CAM::PDF will fail because of "out of memory" when building "page content tree" (as opposed to "just open and parse objects and report this or that").

        I.e. with this edge case I'm checking if it's practical to "build tree" considering speed and memory requirements, and if some improvements (still just ideas) can make it use less memory. And with "R"-less stream I couldn't check that because it would take hours and I didn't understand what's going on. I mean, to test modified parser, maybe ordinary distributed test files are OK.

Re: Why is Perl suddenly slow in THIS case?
by BrowserUk (Patriarch) on Mar 05, 2017 at 19:46 UTC

    Ignore this: I misread (inverted) the meaning of the results.

    <

    I get similar results to you using 5.22:

    C:\test>\Perl22\bin\perl.exe 1183703.pl Rate Q R Q 1.29/s -- -96% R 34.9/s 2595% --

    And more consistent, but slower results using 5.10:

    C:\test>1183703.pl Rate R Q R 46.0/s -- -1% Q 46.7/s 1% --

    My conclusion is that some recent optimisation helps enormously with your Q example, and somewhat, but less so with your R example.

    I hazard a guess that the optimisation is a fast linear search for the presence of any constant bits of the regex means the Q example is very fast -- the R constant isn't there.

    But for the R example, the constant 'R' is present, so the regex engine needs to be fired up, and it probably checks for the presence of one or more digits preceding every occurrence of 'R' in the string.

    Other than adding another optimisation to detect that the string doesn't contain any digits -- a difficult thing to generalise -- you should probably be rejoicing in the optimisation that speeds up the Q case, rather than complaining that it doesn't help in the R case.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Why is Perl suddenly slow in THIS case?
by hdb (Monsignor) on Mar 06, 2017 at 11:15 UTC

    On Strawberry, 5.16.3.1, 64bit, I do not get funny stuff:

    Rate Q R Q 46.0/s -- -3% R 47.2/s 3% --
Re: Why is Perl suddenly slow in THIS case?
by huck (Prior) on Mar 05, 2017 at 19:27 UTC

    \d+R specifies 0 or more digits followed by an R, so when the string is 42,000 R's it matches, triggering next, and the loop continues. Since there are 42,000 R's next is signaled 42,000 times in the loop before it runs out of characters to test. When the string is 42,000 Q's that first test does not match, it falls thru, and since there is no whitespace the loop terminates

    ok, nevermind, i didnt get anypart right did i, \d+ is 1 or more, and \w is a word char not whitespace.