comment on

And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience.

This module failed parsing all of them.

Boiled down to, e.g.:

PDF::Data->new->parse_data( '[([)]' );

Byte offset 0: Parse error on input: "[([)]"
[download]

Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over.

Compare:

use CAM::PDF;
use PDF::API2;
use Data::Dumper::Concise;
print Dumper( CAM::PDF->parseAny( \'[([)]' ));
print Dumper( PDF::API2::Basic::PDF::File->new->readval( "[([)]\n" ));

bless( {
  type => "array",
  value => [
    bless( {
      type => "string",
      value => "[",
    }, 'CAM::PDF::Node' ),
  ],
}, 'CAM::PDF::Node' )
bless( {
  " realised" => 1,
  " val" => [
    bless( {
      " realised" => 1,
      val => "[",
    }, 'PDF::API2::Basic::PDF::String' ),
  ],
}, 'PDF::API2::Basic::PDF::Array' )
[download]

(the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant)

In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow.

And I've not even begun comparing these parsers performance, sadly.

In reply to Re^3: pdf with too long lines by Anonymous Monk
in thread pdf with too long lines by Dirk80

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.