in reply to Re: pdf with too long lines
in thread pdf with too long lines

I believe they are majority of humanity (i.e. amongst PC users) who are both east of Greenwich and on Windows. Then the very innocent PDF::Data->new() would make them scratch their heads contemplating 'Use of uninitialized value in division (/)' warning. Because the author (very nice guy, judging by the video) did mktime(gmtime 0). Oh well, this is so harmless a warning compared to far more grave problems.

Replies are listed 'Best First'.
Re^3: pdf with too long lines
by Anonymous Monk on Aug 25, 2024 at 11:31 UTC

    And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience.

    This module failed parsing all of them.

    Boiled down to, e.g.:

    PDF::Data->new->parse_data( '[([)]' ); Byte offset 0: Parse error on input: "[([)]"

    Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over.

    Compare:

    use CAM::PDF; use PDF::API2; use Data::Dumper::Concise; print Dumper( CAM::PDF->parseAny( \'[([)]' )); print Dumper( PDF::API2::Basic::PDF::File->new->readval( "[([)]\n" )); bless( { type => "array", value => [ bless( { type => "string", value => "[", }, 'CAM::PDF::Node' ), ], }, 'CAM::PDF::Node' ) bless( { " realised" => 1, " val" => [ bless( { " realised" => 1, val => "[", }, 'PDF::API2::Basic::PDF::String' ), ], }, 'PDF::API2::Basic::PDF::Array' )

    (the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant)

    In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow.

    And I've not even begun comparing these parsers performance, sadly.

      Well, instead of getting annoyed, and then forgetting that you already found these limitations and getting annoyed again, you could ... report the bugs to the author?

        Conscious adult decision to ignore the specification because it suits them better is not a bug. Nothing to "report". Nor do I want to improve their regexes. Neither I'm walking around town (or web) pointing finger at whom I think does wrong. It was someone else who came and said "Hurray! Brand new shiny module promoted at TPRC, instead of rusty ugly old ones." "Annoyed"? Not me. Unwelcome nodes can always be downvoted or deleted.

      comparing these parsers performance

      Shall we?

      use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDF::API2; use CAM::PDF; use PDF::Data; sub with_time ($&) { my ( $note, $code ) = @_; my $t = time; my $ret = &$code; printf "%-6.3f - %s\n", time - $t, $note; return $ret } my $fn = 'HigherOrderPerl-trimmed.pdf'; { say ' PDF::API2'; my $pdf = with_time '(1) open (and cache pages)', sub { PDF::API2->open( $fn ) }; with_time '(2) parse/cache everything', sub { $pdf->{ pdf }->read_objnum( $_, 0 ) for 1 .. $pdf->{ pdf }{' maxobj'} - 2 } } { say ' CAM::PDF'; my $pdf = with_time '(1) open', sub { CAM::PDF->new( $fn ) } ; with_time '(2) parse/cache page objects', sub { $pdf->getPage( $_ ) for 1 .. $pdf->numPages }; with_time '(3) parse/cache everything else', sub { $pdf->cacheObjects } } { say ' PDF::Data'; with_time '(1) open and parse everything', sub { PDF::Data->read_pdf( $fn, '-novalidate' => 1 ) } } __END__ PDF::API2 0.649 - (1) open (and cache pages) 1.067 - (2) parse/cache everything CAM::PDF 0.010 - (1) open 0.097 - (2) parse/cache page objects 0.095 - (3) parse/cache everything else PDF::Data 7.002 - (1) open and parse everything

      Modules were created for different tasks/purposes; this comparison is not really practical, just for entertainment and additional proof, to self (though I don't need any), that one parser is superior to alternatives and why I'm using one and won't consider others for my usual purposes, which are inspection/analysis and minor pinpoint changes. If/when I need PDF generation (especially from scratch), there's no choice but PDF::API2.

      Just to clarify, (a) PDF::API2 caches all pages to its internal stack on open, therefore I did additional step for CAM::PDF. And so, (1) + (2) for the latter is to be compared to (1) of the former. (b) PDF::Data also decompresses/inflates all streams (though we have prohibited validation), which takes ~0.5 seconds. It could be either disabled (patching source) or turned on for other participants, but I don't think it's important. The comparison is illustration of regex use efficiency to parse.

Re^3: pdf with too long lines
by perlfan (Parson) on Aug 27, 2024 at 23:15 UTC
    WTH - I think the hackernews and reddit bots have found their way here.