Why is Perl suddenly slow in THIS case?

vr has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Why is Perl suddenly slow in THIS case? by Corion (Patriarch) on Mar 05, 2017 at 19:28 UTC
Perl is doing much more (but smarter) work when it finds a `R` in the string: `use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 10; my $Q = 'Q' x 10; use re 'debug'; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } print "-" x 80, "\n"; foo( \$Q ); print "-" x 80, "\n"; foo( \$R ); print "-" x 80, "\n";` [download] Matching REx "\G\d+R" against "QQQQQQQQQQ" Intuit: trying to determine minimum start position... Did not find floating substr "R"... Match rejected by optimizer Matching REx "\w" against "QQQQQQQQQQ" Matching stclass POSIXD[\w] against "QQQQQQQQQQ" (10 bytes) 0 <> <QQQQQQQQQQ> \| 1:POSIXD[\w](2) 1 <Q> <QQQQQQQQQ> \| 2:END(0) Match successful! ... Matching REx "\G\d+R" against "QQQQQQQQQQ" Regex match can't succeed, so not even tried Matching REx "\w" against "Q" Matching stclass POSIXD[\w] against "Q" (1 bytes) 9 <QQQQQQQQQ> <Q> \| 1:POSIXD[\w](2) 10 <QQQQQQQQQQ> <> \| 2:END(0) Match successful! Matching REx "\G\d+R" against "QQQQQQQQQQ" Regex match can't succeed, so not even tried Matching REx "\w" against "" Regex match can't succeed, so not even tried ---------------------------------------------------- ---------------------------------------------------------------------- +---------- Matching REx "\G\d+R" against "RRRRRRRRRR" Intuit: trying to determine minimum start position... Found floating substr "R" at offset 1... (multiline anchor test skipped) looking for class: start_shift: 1 check_at: 1 rx_origin: 0 endpos: 1 This position contradicts STCLASS... Match rejected by optimizer Matching REx "\w" against "RRRRRRRRRR" Matching stclass POSIXD[\w] against "RRRRRRRRRR" (10 bytes) 0 <> <RRRRRRRRRR> \| 1:POSIXD[\w](2) 1 <R> <RRRRRRRRR> \| 2:END(0) Match successful! Matching REx "\G\d+R" against "RRRRRRRRRR" Intuit: trying to determine minimum start position... Found floating substr "R" at offset 1... (multiline anchor test skipped) looking for class: start_shift: 1 check_at: 2 rx_origin: 1 endpos: 2 This position contradicts STCLASS... Match rejected by optimizer Matching REx "\w" against "RRRRRRRRR" Matching stclass POSIXD[\w] against "RRRRRRRRR" (9 bytes) 1 <R> <RRRRRRRRR> \| 1:POSIXD[\w](2) 2 <RR> <RRRRRRRR> \| 2:END(0) Match successful! ... [download] I'm not sure how you could make Perl try not to scan the string for the fixed substring `R` but immediately look for `\G\d` to immediately reject the parse at that location, no matter what follows. By splitting up the parsing between `\d+` and `R`, I can make both cases behave ~~slightly~~ 18% worse than `\G\d+R`, but I'm not sure if that's correct: use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 42_000; my $Q = 'Q' x 42_000; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } sub foo_twostep { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G[0-9]+/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } cmpthese -3, { R => sub { foo( \$R )}, Q => sub { foo( \$Q )}, R_twostep => sub { foo_twostep( \$R )}, Q_twostep => sub { foo_twostep( \$Q )}, } __END__ Rate Q R_twostep Q_twostep R Q 2.01/s -- -97% -97% -97% R_twostep 65.3/s 3147% -- 0% -15% Q_twostep 65.3/s 3147% 0% -- -15% R 77.1/s 3732% 18% 18% -- [download] Update: Using only ASCII digits, I can get a slowdown of 12%: use strict; use warnings; use Benchmark qw/ cmpthese /; my $R = 'R' x 42_000; my $Q = 'Q' x 42_000; sub foo { my $c = shift; pos $$c = 0; while ( 1 ) { next if $$c =~ /\G\d+R/; last unless $$c =~ /\w/g; } } sub foo_twostep { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G\d+/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } sub foo_asciidigits { my $c = shift; pos $$c = 0; while ( 1 ) { next if ($$c =~ /\G[0-9]/ and $$c =~ /\GR/); last unless $$c =~ /\w/g; } } cmpthese -3, { R => sub { foo( \$R )}, Q => sub { foo( \$Q )}, R_twostep => sub { foo_twostep( \$R )}, Q_twostep => sub { foo_twostep( \$Q )}, R_ascii => sub { foo_asciidigits( \$R )}, Q_ascii => sub { foo_asciidigits( \$Q )}, } __END__ Rate Q Q_twostep R_twostep R_ascii Q_ascii + R Q 1.99/s -- -97% -97% -97% -97% + -97% Q_twostep 62.9/s 3054% -- -1% -7% -8% + -17% R_twostep 63.8/s 3101% 1% -- -5% -6% + -16% R_ascii 67.5/s 3287% 7% 6% -- -1% + -11% Q_ascii 68.1/s 3319% 8% 7% 1% -- + -10% R 76.0/s 3715% 21% 19% 13% 12% + -- [download]	[reply] [d/l] [select]
Re^2: Why is Perl suddenly slow in THIS case? by Corion (Patriarch) on Mar 05, 2017 at 19:40 UTC
I just realized that my two-step approach is not the same, as it advances `\G` whenever it would find `\d+` even if not followed by R. That may or may not be desireable in the more general PDF parsing code.	[reply] [d/l] [select]
Re^3: Why is Perl suddenly slow in THIS case? by vr (Curate) on Mar 06, 2017 at 11:00 UTC
It makes sense now, with your explanation, thank you. Intuit: trying to determine minimum start position... But isn't this "start position" already determined by `\G`? :) In my case it would be easy. Indirect objects can't happen in content streams, so modified (without first check) `parseAny_in_stream` can be added. It then is called as class method in `CAM::PDF::Content`. On the other hand, the 150 Mb "R"-less streams are probably too rare to justify any trouble making this fix.	[reply] [d/l] [select]
Re^2: Why is Perl suddenly slow in THIS case? by Corion (Patriarch) on Mar 06, 2017 at 11:47 UTC
As the original case is some parser like this: sub parseAny { #my $p = shift; # pkg or doc my $c = shift; #my $objnum = shift; #my $gennum = shift; return ${$c} =~ m/ \G \d+\s+\d+\s+R\b /xms ? 'parseRef( $c, $ +objnum, $gennum)' : ${$c} =~ m{ \G / }xms ? 'parseLabel( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G << /xms ? 'parseDict( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G \[ /xms ? 'parseArray( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G [(] /xms ? 'parseString( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G < /xms ? 'parseHexString($c, $ +objnum, $gennum)' : ${$c} =~ m/ \G [\d.+-]+ /xms ? 'parseNum( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G (true\|false) /ixms ? 'parseBoolean( $c, $ +objnum, $gennum)' : ${$c} =~ m/ \G null /ixms ? 'parseNull( $c, $ +objnum, $gennum)' : die "Unrecognized type in parseAny\n"; } [download] I think the best approach would be a tokenizer that takes the first character and decides from that what to do. This would mean rewriting the regex into something really unreadable like: sub parseAny_token { #my $p = shift; # pkg or doc my $c = shift; #my $objnum = shift; #my $gennum = shift; my $ch = m{ \G (?: ([0-9]+) \|(/) \|(<<) \|(\[) \|\([.+-]) \|(true\|false) \|(null) }xmsi } or die "Unrecognized type in parseAny\n"; # now dispatch based on $1 etc: if( defined $1 ) { my $num = $1; if( m/\G\s+\d+\s+R\b/ ) { # Handle $num $num $ parseRef( $c, $objnum, $gennum) } elsif( m/\G([-+.\d+])/ ) { # handle "$num$1" "$num$1 parseNum( $c, $objnum, $gennum) } else { # handle "$num" }; } elsif( defined $2 ) { # / parseLabel( $c, $objnum, $gennum) } ... } [download] That would need a lot of good unit tests to make sure the grammar rewrite still works and especially still picks up parsing at the right places when something like a `+R` comes in the input stream. In my toy implementation for the tokenizer, I get 90% of the performance of the original R case for both cases. Maybe it would be worth to share your problematic PDF with the author of CAM::PDF (or me) if you can, just to see whether it can be turned into a good test case...	[reply] [d/l] [select]
Re^3: Why is Perl suddenly slow in THIS case? by vr (Curate) on Mar 06, 2017 at 12:42 UTC
A particular file was chosen as extreme edge case. Huge PDFs are common, but their size is mostly because of images. Here it's "pure vector graphics", and I experimented with how soon `CAM::PDF` will fail because of "out of memory" when building "page content tree" (as opposed to "just open and parse objects and report this or that"). I.e. with this edge case I'm checking if it's practical to "build tree" considering speed and memory requirements, and if some improvements (still just ideas) can make it use less memory. And with "R"-less stream I couldn't check that because it would take hours and I didn't understand what's going on. I mean, to test modified parser, maybe ordinary distributed test files are OK.	[reply] [d/l]
Re: Why is Perl suddenly slow in THIS case? by BrowserUk (Patriarch) on Mar 05, 2017 at 19:46 UTC
Ignore this: I misread (inverted) the meaning of the results. < I get similar results to you using 5.22: `C:\test>\Perl22\bin\perl.exe 1183703.pl Rate Q R Q 1.29/s -- -96% R 34.9/s 2595% --` [download] And more consistent, but slower results using 5.10: `C:\test>1183703.pl Rate R Q R 46.0/s -- -1% Q 46.7/s 1% --` [download] My conclusion is that some recent optimisation helps enormously with your Q example, and somewhat, but less so with your R example. I hazard a guess that the optimisation is a fast linear search for the presence of any constant bits of the regex means the Q example is very fast -- the R constant isn't there. But for the R example, the constant 'R' is present, so the regex engine needs to be fired up, and it probably checks for the presence of one or more digits preceding every occurrence of 'R' in the string. Other than adding another optimisation to detect that the string doesn't contain any digits -- a difficult thing to generalise -- you should probably be rejoicing in the optimisation that speeds up the Q case, rather than complaining that it doesn't help in the R case. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Why is Perl suddenly slow in THIS case? by hdb (Monsignor) on Mar 06, 2017 at 11:15 UTC
On Strawberry, 5.16.3.1, 64bit, I do not get funny stuff: `Rate Q R Q 46.0/s -- -3% R 47.2/s 3% --` [download]	[reply] [d/l]
Re: Why is Perl suddenly slow in THIS case? by huck (Prior) on Mar 05, 2017 at 19:27 UTC
`\d+R` specifies 0 or more digits followed by an R, so when the string is 42,000 R's it matches, triggering next, and the loop continues. Since there are 42,000 R's next is signaled 42,000 times in the loop before it runs out of characters to test. When the string is 42,000 Q's that first test does not match, it falls thru, and since there is no whitespace the loop terminates ok, nevermind, i didnt get anypart right did i, \d+ is 1 or more, and \w is a word char not whitespace.	[reply] [d/l]