Re^2: Why is Perl suddenly slow in THIS case?

As the original case is some parser like this:

sub parseAny
{
   #my $p      = shift;  # pkg or doc
   my $c      = shift;
   #my $objnum = shift;
   #my $gennum = shift;
 
   return ${$c} =~ m/ \G \d+\s+\d+\s+R\b /xms  ? 'parseRef(      $c, $
+objnum, $gennum)'
        : ${$c} =~ m{ \G /               }xms  ? 'parseLabel(    $c, $
+objnum, $gennum)'
        : ${$c} =~ m/ \G <<              /xms  ? 'parseDict(     $c, $
+objnum, $gennum)'
        : ${$c} =~ m/ \G \[              /xms  ? 'parseArray(    $c, $
+objnum, $gennum)'
        : ${$c} =~ m/ \G [(]             /xms  ? 'parseString(   $c, $
+objnum, $gennum)'
        : ${$c} =~ m/ \G <               /xms  ? 'parseHexString($c, $
+objnum, $gennum)'
        : ${$c} =~ m/ \G [\d.+-]+        /xms  ? 'parseNum(      $c, $
+objnum, $gennum)'
        : ${$c} =~ m/ \G (true|false)    /ixms ? 'parseBoolean(  $c, $
+objnum, $gennum)'
        : ${$c} =~ m/ \G null            /ixms ? 'parseNull(     $c, $
+objnum, $gennum)'
        : die "Unrecognized type in parseAny\n";
}
[download]

I think the best approach would be a tokenizer that takes the first character and decides from that what to do. This would mean rewriting the regex into something really unreadable like:

sub parseAny_token
{
   #my $p      = shift;  # pkg or doc
   my $c      = shift;
   #my $objnum = shift;
   #my $gennum = shift;
 
   my $ch = m{ \G (?:
               ([0-9]+)
              |(/)
              |(<<)
              |(\[)
              |\([.+-])
              |(true|false)
              |(null) }xmsi
   } or die "Unrecognized type in parseAny\n";

   # now dispatch based on $1 etc:
   if( defined $1 ) {
       my $num = $1;
       if( m/\G\s+\d+\s+R\b/ ) {
           # Handle $num $num $
           parseRef(      $c, $objnum, $gennum)
       } elsif( m/\G([-+.\d+])/ ) {
           # handle "$num$1"
           "$num$1
           parseNum(      $c, $objnum, $gennum)
       } else {
           # handle "$num"
       };
   } elsif( defined $2 ) { # /
       parseLabel(    $c, $objnum, $gennum)
   }
   ...
}
[download]

That would need a lot of good unit tests to make sure the grammar rewrite still works and especially still picks up parsing at the right places when something like a +R comes in the input stream.

In my toy implementation for the tokenizer, I get 90% of the performance of the original R case for both cases. Maybe it would be worth to share your problematic PDF with the author of CAM::PDF (or me) if you can, just to see whether it can be turned into a good test case...

Comment on Re^2: Why is Perl suddenly slow in THIS case? Select or Download Code

Replies are listed 'Best First'.
Re^3: Why is Perl suddenly slow in THIS case? by vr (Curate) on Mar 06, 2017 at 12:42 UTC
A particular file was chosen as extreme edge case. Huge PDFs are common, but their size is mostly because of images. Here it's "pure vector graphics", and I experimented with how soon `CAM::PDF` will fail because of "out of memory" when building "page content tree" (as opposed to "just open and parse objects and report this or that"). I.e. with this edge case I'm checking if it's practical to "build tree" considering speed and memory requirements, and if some improvements (still just ideas) can make it use less memory. And with "R"-less stream I couldn't check that because it would take hours and I didn't understand what's going on. I mean, to test modified parser, maybe ordinary distributed test files are OK.	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Why is Perl suddenly slow in THIS case?
by vr (Curate) on Mar 06, 2017 at 12:42 UTC

A particular file was chosen as extreme edge case. Huge PDFs are common, but their size is mostly because of images. Here it's "pure vector graphics", and I experimented with how soon CAM::PDF will fail because of "out of memory" when building "page content tree" (as opposed to "just open and parse objects and report this or that").

I.e. with this edge case I'm checking if it's practical to "build tree" considering speed and memory requirements, and if some improvements (still just ideas) can make it use less memory. And with "R"-less stream I couldn't check that because it would take hours and I didn't understand what's going on. I mean, to test modified parser, maybe ordinary distributed test files are OK.

[reply]
[d/l]