in reply to Why is Perl suddenly slow in THIS case?

Perl is doing much more (but smarter) work when it finds a R in the string:

use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 10; my $Q = 'Q' x 10; use re 'debug'; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } print "-" x 80, "\n"; foo( \$Q ); print "-" x 80, "\n"; foo( \$R ); print "-" x 80, "\n";
Matching REx "\G\d+R" against "QQQQQQQQQQ" Intuit: trying to determine minimum start position... Did not find floating substr "R"... Match rejected by optimizer Matching REx "\w" against "QQQQQQQQQQ" Matching stclass POSIXD[\w] against "QQQQQQQQQQ" (10 bytes) 0 <> <QQQQQQQQQQ> | 1:POSIXD[\w](2) 1 <Q> <QQQQQQQQQ> | 2:END(0) Match successful! ... Matching REx "\G\d+R" against "QQQQQQQQQQ" Regex match can't succeed, so not even tried Matching REx "\w" against "Q" Matching stclass POSIXD[\w] against "Q" (1 bytes) 9 <QQQQQQQQQ> <Q> | 1:POSIXD[\w](2) 10 <QQQQQQQQQQ> <> | 2:END(0) Match successful! Matching REx "\G\d+R" against "QQQQQQQQQQ" Regex match can't succeed, so not even tried Matching REx "\w" against "" Regex match can't succeed, so not even tried ---------------------------------------------------- ---------------------------------------------------------------------- +---------- Matching REx "\G\d+R" against "RRRRRRRRRR" Intuit: trying to determine minimum start position... Found floating substr "R" at offset 1... (multiline anchor test skipped) looking for class: start_shift: 1 check_at: 1 rx_origin: 0 endpos: 1 This position contradicts STCLASS... Match rejected by optimizer Matching REx "\w" against "RRRRRRRRRR" Matching stclass POSIXD[\w] against "RRRRRRRRRR" (10 bytes) 0 <> <RRRRRRRRRR> | 1:POSIXD[\w](2) 1 <R> <RRRRRRRRR> | 2:END(0) Match successful! Matching REx "\G\d+R" against "RRRRRRRRRR" Intuit: trying to determine minimum start position... Found floating substr "R" at offset 1... (multiline anchor test skipped) looking for class: start_shift: 1 check_at: 2 rx_origin: 1 endpos: 2 This position contradicts STCLASS... Match rejected by optimizer Matching REx "\w" against "RRRRRRRRR" Matching stclass POSIXD[\w] against "RRRRRRRRR" (9 bytes) 1 <R> <RRRRRRRRR> | 1:POSIXD[\w](2) 2 <RR> <RRRRRRRR> | 2:END(0) Match successful! ...

I'm not sure how you could make Perl try not to scan the string for the fixed substring R but immediately look for \G\d to immediately reject the parse at that location, no matter what follows.

By splitting up the parsing between \d+ and R, I can make both cases behave slightly 18% worse than \G\d+R, but I'm not sure if that's correct:

use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 42_000; my $Q = 'Q' x 42_000; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } sub foo_twostep { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G[0-9]+/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } cmpthese -3, { R => sub { foo( \$R )}, Q => sub { foo( \$Q )}, R_twostep => sub { foo_twostep( \$R )}, Q_twostep => sub { foo_twostep( \$Q )}, } __END__ Rate Q R_twostep Q_twostep R Q 2.01/s -- -97% -97% -97% R_twostep 65.3/s 3147% -- 0% -15% Q_twostep 65.3/s 3147% 0% -- -15% R 77.1/s 3732% 18% 18% --

Update: Using only ASCII digits, I can get a slowdown of 12%:

use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 42_000; my $Q = 'Q' x 42_000; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } sub foo_twostep { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G\d+/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } sub foo_asciidigits { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G[0-9]/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } cmpthese -3, { R => sub { foo( \$R )}, Q => sub { foo( \$Q )}, R_twostep => sub { foo_twostep( \$R )}, Q_twostep => sub { foo_twostep( \$Q )}, R_ascii => sub { foo_asciidigits( \$R )}, Q_ascii => sub { foo_asciidigits( \$Q )}, } __END__ Rate Q Q_twostep R_twostep R_ascii Q_ascii + R Q 1.99/s -- -97% -97% -97% -97% + -97% Q_twostep 62.9/s 3054% -- -1% -7% -8% + -17% R_twostep 63.8/s 3101% 1% -- -5% -6% + -16% R_ascii 67.5/s 3287% 7% 6% -- -1% + -11% Q_ascii 68.1/s 3319% 8% 7% 1% -- + -10% R 76.0/s 3715% 21% 19% 13% 12% + --

Replies are listed 'Best First'.
Re^2: Why is Perl suddenly slow in THIS case?
by Corion (Patriarch) on Mar 05, 2017 at 19:40 UTC

    I just realized that my two-step approach is not the same, as it advances \G whenever it would find \d+ even if not followed by R. That may or may not be desireable in the more general PDF parsing code.

      It makes sense now, with your explanation, thank you.

      Intuit: trying to determine minimum start position...

      But isn't this "start position" already determined by \G? :)

      In my case it would be easy. Indirect objects can't happen in content streams, so modified (without first check) parseAny_in_stream can be added. It then is called as class method in CAM::PDF::Content. On the other hand, the 150 Mb "R"-less streams are probably too rare to justify any trouble making this fix.

Re^2: Why is Perl suddenly slow in THIS case?
by Corion (Patriarch) on Mar 06, 2017 at 11:47 UTC

    As the original case is some parser like this:

    sub parseAny { #my $p = shift; # pkg or doc my $c = shift; #my $objnum = shift; #my $gennum = shift; return ${$c} =~ m/ \G \d+\s+\d+\s+R\b /xms ? 'parseRef( $c, $ +objnum, $gennum)' : ${$c} =~ m{ \G / }xms ? 'parseLabel( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G << /xms ? 'parseDict( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G \[ /xms ? 'parseArray( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G [(] /xms ? 'parseString( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G < /xms ? 'parseHexString($c, $ +objnum, $gennum)' : ${$c} =~ m/ \G [\d.+-]+ /xms ? 'parseNum( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G (true|false) /ixms ? 'parseBoolean( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G null /ixms ? 'parseNull( $c, $ +objnum, $gennum)' : die "Unrecognized type in parseAny\n"; }

    I think the best approach would be a tokenizer that takes the first character and decides from that what to do. This would mean rewriting the regex into something really unreadable like:

    sub parseAny_token { #my $p = shift; # pkg or doc my $c = shift; #my $objnum = shift; #my $gennum = shift; my $ch = m{ \G (?: ([0-9]+) |(/) |(<<) |(\[) |\([.+-]) |(true|false) |(null) }xmsi } or die "Unrecognized type in parseAny\n"; # now dispatch based on $1 etc: if( defined $1 ) { my $num = $1; if( m/\G\s+\d+\s+R\b/ ) { # Handle $num $num $ parseRef( $c, $objnum, $gennum) } elsif( m/\G([-+.\d+])/ ) { # handle "$num$1" "$num$1 parseNum( $c, $objnum, $gennum) } else { # handle "$num" }; } elsif( defined $2 ) { # / parseLabel( $c, $objnum, $gennum) } ... }

    That would need a lot of good unit tests to make sure the grammar rewrite still works and especially still picks up parsing at the right places when something like a +R comes in the input stream.

    In my toy implementation for the tokenizer, I get 90% of the performance of the original R case for both cases. Maybe it would be worth to share your problematic PDF with the author of CAM::PDF (or me) if you can, just to see whether it can be turned into a good test case...

      A particular file was chosen as extreme edge case. Huge PDFs are common, but their size is mostly because of images. Here it's "pure vector graphics", and I experimented with how soon CAM::PDF will fail because of "out of memory" when building "page content tree" (as opposed to "just open and parse objects and report this or that").

      I.e. with this edge case I'm checking if it's practical to "build tree" considering speed and memory requirements, and if some improvements (still just ideas) can make it use less memory. And with "R"-less stream I couldn't check that because it would take hours and I didn't understand what's going on. I mean, to test modified parser, maybe ordinary distributed test files are OK.