And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience.
This module failed parsing all of them.
Boiled down to, e.g.:
PDF::Data->new->parse_data( '[([)]' );
Byte offset 0: Parse error on input: "[([)]"
Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over.
Compare:
use CAM::PDF;
use PDF::API2;
use Data::Dumper::Concise;
print Dumper( CAM::PDF->parseAny( \'[([)]' ));
print Dumper( PDF::API2::Basic::PDF::File->new->readval( "[([)]\n" ));
bless( {
type => "array",
value => [
bless( {
type => "string",
value => "[",
}, 'CAM::PDF::Node' ),
],
}, 'CAM::PDF::Node' )
bless( {
" realised" => 1,
" val" => [
bless( {
" realised" => 1,
val => "[",
}, 'PDF::API2::Basic::PDF::String' ),
],
}, 'PDF::API2::Basic::PDF::Array' )
(the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant)
In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow.
And I've not even begun comparing these parsers performance, sadly.
|