Re^3: pdf with too long lines

And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience.

This module failed parsing all of them.

Boiled down to, e.g.:

PDF::Data->new->parse_data( '[([)]' );

Byte offset 0: Parse error on input: "[([)]"
[download]

Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over.

Compare:

use CAM::PDF;
use PDF::API2;
use Data::Dumper::Concise;
print Dumper( CAM::PDF->parseAny( \'[([)]' ));
print Dumper( PDF::API2::Basic::PDF::File->new->readval( "[([)]\n" ));

bless( {
  type => "array",
  value => [
    bless( {
      type => "string",
      value => "[",
    }, 'CAM::PDF::Node' ),
  ],
}, 'CAM::PDF::Node' )
bless( {
  " realised" => 1,
  " val" => [
    bless( {
      " realised" => 1,
      val => "[",
    }, 'PDF::API2::Basic::PDF::String' ),
  ],
}, 'PDF::API2::Basic::PDF::Array' )
[download]

(the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant)

In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow.

And I've not even begun comparing these parsers performance, sadly.

Comment on Re^3: pdf with too long lines Select or Download Code

Replies are listed 'Best First'.
Re^4: pdf with too long lines by Anonymous Monk on Aug 27, 2024 at 01:52 UTC
Well, instead of getting annoyed, and then forgetting that you already found these limitations and getting annoyed again, you could ... report the bugs to the author?	[reply]
Re^5: pdf with too long lines by Anonymous Monk on Aug 27, 2024 at 08:43 UTC
Conscious adult decision to ignore the specification because it suits them better is not a bug. Nothing to "report". Nor do I want to improve their regexes. Neither I'm walking around town (or web) pointing finger at whom I think does wrong. It was someone else who came and said "Hurray! Brand new shiny module promoted at TPRC, instead of rusty ugly old ones." "Annoyed"? Not me. Unwelcome nodes can always be downvoted or deleted.	[reply]
Re^4: pdf with too long lines by Anonymous Monk on Aug 26, 2024 at 09:28 UTC
comparing these parsers performance Shall we? use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDF::API2; use CAM::PDF; use PDF::Data; sub with_time ($&) { my ( $note, $code ) = @_; my $t = time; my $ret = &$code; printf "%-6.3f - %s\n", time - $t, $note; return $ret } my $fn = 'HigherOrderPerl-trimmed.pdf'; { say ' PDF::API2'; my $pdf = with_time '(1) open (and cache pages)', sub { PDF::API2->open( $fn ) }; with_time '(2) parse/cache everything', sub { $pdf->{ pdf }->read_objnum( $_, 0 ) for 1 .. $pdf->{ pdf }{' maxobj'} - 2 } } { say ' CAM::PDF'; my $pdf = with_time '(1) open', sub { CAM::PDF->new( $fn ) } ; with_time '(2) parse/cache page objects', sub { $pdf->getPage( $_ ) for 1 .. $pdf->numPages }; with_time '(3) parse/cache everything else', sub { $pdf->cacheObjects } } { say ' PDF::Data'; with_time '(1) open and parse everything', sub { PDF::Data->read_pdf( $fn, '-novalidate' => 1 ) } } __END__ PDF::API2 0.649 - (1) open (and cache pages) 1.067 - (2) parse/cache everything CAM::PDF 0.010 - (1) open 0.097 - (2) parse/cache page objects 0.095 - (3) parse/cache everything else PDF::Data 7.002 - (1) open and parse everything [download] Modules were created for different tasks/purposes; this comparison is not really practical, just for entertainment and additional proof, to self (though I don't need any), that one parser is superior to alternatives and why I'm using one and won't consider others for my usual purposes, which are inspection/analysis and minor pinpoint changes. If/when I need PDF generation (especially from scratch), there's no choice but `PDF::API2`. Just to clarify, (a) `PDF::API2` caches all pages to its internal stack on open, therefore I did additional step for `CAM::PDF`. And so, (1) + (2) for the latter is to be compared to (1) of the former. (b) `PDF::Data` also decompresses/inflates all streams (though we have prohibited validation), which takes ~0.5 seconds. It could be either disabled (patching source) or turned on for other participants, but I don't think it's important. The comparison is illustration of regex use efficiency to parse.	[reply] [d/l] [select]