Re: pdf with too long lines

Replies are listed 'Best First'.
Re^2: pdf with too long lines by Anonymous Monk on Aug 25, 2024 at 09:06 UTC
I believe they are majority of humanity (i.e. amongst PC users) who are both east of Greenwich and on Windows. Then the very innocent `PDF::Data->new()` would make them scratch their heads contemplating `'Use of uninitialized value in division (/)'` warning. Because the author (very nice guy, judging by the video) did `mktime(gmtime 0)`. Oh well, this is so harmless a warning compared to far more grave problems.	[reply] [d/l] [select]
Re^3: pdf with too long lines by Anonymous Monk on Aug 25, 2024 at 11:31 UTC
And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience. This module failed parsing all of them. Boiled down to, e.g.: `PDF::Data->new->parse_data( '[([)]' ); Byte offset 0: Parse error on input: "[([)]"` [download] Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over. Compare: `use CAM::PDF; use PDF::API2; use Data::Dumper::Concise; print Dumper( CAM::PDF->parseAny( \'[([)]' )); print Dumper( PDF::API2::Basic::PDF::File->new->readval( "[([)]\n" )); bless( { type => "array", value => [ bless( { type => "string", value => "[", }, 'CAM::PDF::Node' ), ], }, 'CAM::PDF::Node' ) bless( { " realised" => 1, " val" => [ bless( { " realised" => 1, val => "[", }, 'PDF::API2::Basic::PDF::String' ), ], }, 'PDF::API2::Basic::PDF::Array' )` [download] (the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant) In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow. And I've not even begun comparing these parsers performance, sadly.	[reply] [d/l] [select]
Re^4: pdf with too long lines by Anonymous Monk on Aug 27, 2024 at 01:52 UTC
Well, instead of getting annoyed, and then forgetting that you already found these limitations and getting annoyed again, you could ... report the bugs to the author?	[reply]
Re^5: pdf with too long lines by Anonymous Monk on Aug 27, 2024 at 08:43 UTC
Re^4: pdf with too long lines by Anonymous Monk on Aug 26, 2024 at 09:28 UTC
comparing these parsers performance Shall we? use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDF::API2; use CAM::PDF; use PDF::Data; sub with_time ($&) { my ( $note, $code ) = @_; my $t = time; my $ret = &$code; printf "%-6.3f - %s\n", time - $t, $note; return $ret } my $fn = 'HigherOrderPerl-trimmed.pdf'; { say ' PDF::API2'; my $pdf = with_time '(1) open (and cache pages)', sub { PDF::API2->open( $fn ) }; with_time '(2) parse/cache everything', sub { $pdf->{ pdf }->read_objnum( $_, 0 ) for 1 .. $pdf->{ pdf }{' maxobj'} - 2 } } { say ' CAM::PDF'; my $pdf = with_time '(1) open', sub { CAM::PDF->new( $fn ) } ; with_time '(2) parse/cache page objects', sub { $pdf->getPage( $_ ) for 1 .. $pdf->numPages }; with_time '(3) parse/cache everything else', sub { $pdf->cacheObjects } } { say ' PDF::Data'; with_time '(1) open and parse everything', sub { PDF::Data->read_pdf( $fn, '-novalidate' => 1 ) } } __END__ PDF::API2 0.649 - (1) open (and cache pages) 1.067 - (2) parse/cache everything CAM::PDF 0.010 - (1) open 0.097 - (2) parse/cache page objects 0.095 - (3) parse/cache everything else PDF::Data 7.002 - (1) open and parse everything [download] Modules were created for different tasks/purposes; this comparison is not really practical, just for entertainment and additional proof, to self (though I don't need any), that one parser is superior to alternatives and why I'm using one and won't consider others for my usual purposes, which are inspection/analysis and minor pinpoint changes. If/when I need PDF generation (especially from scratch), there's no choice but `PDF::API2`. Just to clarify, (a) `PDF::API2` caches all pages to its internal stack on open, therefore I did additional step for `CAM::PDF`. And so, (1) + (2) for the latter is to be compared to (1) of the former. (b) `PDF::Data` also decompresses/inflates all streams (though we have prohibited validation), which takes ~0.5 seconds. It could be either disabled (patching source) or turned on for other participants, but I don't think it's important. The comparison is illustration of regex use efficiency to parse.	[reply] [d/l] [select]
Re^3: pdf with too long lines by perlfan (Parson) on Aug 27, 2024 at 23:15 UTC
WTH - I think the hackernews and reddit bots have found their way here.	[reply]