in reply to Re^2: pdf with too long lines
in thread pdf with too long lines
And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience.
This module failed parsing all of them.
Boiled down to, e.g.:
PDF::Data->new->parse_data( '[([)]' ); Byte offset 0: Parse error on input: "[([)]"
Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over.
Compare:
use CAM::PDF; use PDF::API2; use Data::Dumper::Concise; print Dumper( CAM::PDF->parseAny( \'[([)]' )); print Dumper( PDF::API2::Basic::PDF::File->new->readval( "[([)]\n" )); bless( { type => "array", value => [ bless( { type => "string", value => "[", }, 'CAM::PDF::Node' ), ], }, 'CAM::PDF::Node' ) bless( { " realised" => 1, " val" => [ bless( { " realised" => 1, val => "[", }, 'PDF::API2::Basic::PDF::String' ), ], }, 'PDF::API2::Basic::PDF::Array' )
(the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant)
In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow.
And I've not even begun comparing these parsers performance, sadly.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: pdf with too long lines
by Anonymous Monk on Aug 27, 2024 at 01:52 UTC | |
by Anonymous Monk on Aug 27, 2024 at 08:43 UTC | |
|
Re^4: pdf with too long lines
by Anonymous Monk on Aug 26, 2024 at 09:28 UTC |