in reply to pdf with too long lines
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: pdf with too long lines
by Anonymous Monk on Aug 25, 2024 at 09:06 UTC | |
I believe they are majority of humanity (i.e. amongst PC users) who are both east of Greenwich and on Windows. Then the very innocent PDF::Data->new() would make them scratch their heads contemplating 'Use of uninitialized value in division (/)' warning. Because the author (very nice guy, judging by the video) did mktime(gmtime 0). Oh well, this is so harmless a warning compared to far more grave problems. | [reply] [d/l] [select] |
by Anonymous Monk on Aug 25, 2024 at 11:31 UTC | |
And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience. This module failed parsing all of them. Boiled down to, e.g.:
Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over. Compare:
(the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant) In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow. And I've not even begun comparing these parsers performance, sadly. | [reply] [d/l] [select] |
by Anonymous Monk on Aug 27, 2024 at 01:52 UTC | |
| [reply] |
by Anonymous Monk on Aug 27, 2024 at 08:43 UTC | |
by Anonymous Monk on Aug 26, 2024 at 09:28 UTC | |
comparing these parsers performance Shall we?
Modules were created for different tasks/purposes; this comparison is not really practical, just for entertainment and additional proof, to self (though I don't need any), that one parser is superior to alternatives and why I'm using one and won't consider others for my usual purposes, which are inspection/analysis and minor pinpoint changes. If/when I need PDF generation (especially from scratch), there's no choice but PDF::API2. Just to clarify, (a) PDF::API2 caches all pages to its internal stack on open, therefore I did additional step for CAM::PDF. And so, (1) + (2) for the latter is to be compared to (1) of the former. (b) PDF::Data also decompresses/inflates all streams (though we have prohibited validation), which takes ~0.5 seconds. It could be either disabled (patching source) or turned on for other participants, but I don't think it's important. The comparison is illustration of regex use efficiency to parse. | [reply] [d/l] [select] |
by perlfan (Parson) on Aug 27, 2024 at 23:15 UTC | |
| [reply] |