pdf with too long lines

Dirk80 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: pdf with too long lines by LanX (Saint) on Aug 22, 2024 at 20:08 UTC
`pdftohtml -xml` will give you the necessary data in a very simple XML. So simple it's easily parsable by regex. You'll see `<text...>` tags defining boxes enclosing lines. Those have point coordinates for `left` and `width` . That's enough to identify pages with text outside page boundaries. Compare Parsing PDFs by text position? Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: pdf with too long lines by Dirk80 (Pilgrim) on Aug 23, 2024 at 07:05 UTC
Thank you very much. This should be a good way how to do it. I tried it, but I get the error message that my pdf file is version 1.6 and pdftohtml only supports until version 1.5. But I could downgrade it and then use the tool. Everything worked fine. I'm happy with the solution.	[reply]
Re^3: pdf with too long lines by LanX (Saint) on Aug 23, 2024 at 10:53 UTC
Happy it worked :) > pdftohtml only supports until version 1.5. That's surprising me, I remember looking into the C code and AFAIR it's just sending the PDF thru ghostscript and parsing the plot output. Not sure why ghostscript should fault now. 🤷🏻‍♂️ Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re: pdf with too long lines (OT) by LanX (Saint) on Aug 23, 2024 at 13:36 UTC
I know you already have your solution, but two more comments: > Often text lines go beyond the page limit of the A4 format at the right side. So the text is there but not visible. (Kind of off topic here :) I'm not sure if you're talking about print or online viewing, but this sounds like the document was produced for another paper size like `letter` š. So rescaling "to fit" or changing the page format on the creator level should help here. Maybe try this inside your "Pro Acrobat" tool ? ˛ > Or much better could I automatically minimize the right border of all text boxes to be in the visible are plus a small margin? Hmm this depends on the level of semantic info available. Many tools only have positional informations and lack your/ artificial "intelligence" to tell what should be resized and what not. For instance you don't want your page numbers and headings to skip because the main body is reflown. Even with an elaborate CPAN module you'd need a lot of try and error. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} Updates š) PS: metric rules! =D ˛) there are plenty of online services offering to resize PDF	[reply] [d/l]
Re: pdf with too long lines by perlfan (Parson) on Aug 24, 2024 at 20:21 UTC
Direct Access to PDF Internals with PDF::Data - Deven Corzine - TPRC 2024 - recent and may give you some confidence in some modules over others. The one presented is PDF::Data.	[reply]
Re^2: pdf with too long lines by Anonymous Monk on Aug 25, 2024 at 09:06 UTC
I believe they are majority of humanity (i.e. amongst PC users) who are both east of Greenwich and on Windows. Then the very innocent `PDF::Data->new()` would make them scratch their heads contemplating `'Use of uninitialized value in division (/)'` warning. Because the author (very nice guy, judging by the video) did `mktime(gmtime 0)`. Oh well, this is so harmless a warning compared to far more grave problems.	[reply] [d/l] [select]
Re^3: pdf with too long lines by Anonymous Monk on Aug 25, 2024 at 11:31 UTC
And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience. This module failed parsing all of them. Boiled down to, e.g.: `PDF::Data->new->parse_data( '[([)]' ); Byte offset 0: Parse error on input: "[([)]"` [download] Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over. Compare: `use CAM::PDF; use PDF::API2; use Data::Dumper::Concise; print Dumper( CAM::PDF->parseAny( \'[([)]' )); print Dumper( PDF::API2::Basic::PDF::File->new->readval( "[([)]\n" )); bless( { type => "array", value => [ bless( { type => "string", value => "[", }, 'CAM::PDF::Node' ), ], }, 'CAM::PDF::Node' ) bless( { " realised" => 1, " val" => [ bless( { " realised" => 1, val => "[", }, 'PDF::API2::Basic::PDF::String' ), ], }, 'PDF::API2::Basic::PDF::Array' )` [download] (the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant) In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow. And I've not even begun comparing these parsers performance, sadly.	[reply] [d/l] [select]
Re^4: pdf with too long lines by Anonymous Monk on Aug 27, 2024 at 01:52 UTC
Re^5: pdf with too long lines by Anonymous Monk on Aug 27, 2024 at 08:43 UTC
Re^4: pdf with too long lines by Anonymous Monk on Aug 26, 2024 at 09:28 UTC
Re^3: pdf with too long lines by perlfan (Parson) on Aug 27, 2024 at 23:15 UTC
WTH - I think the hackernews and reddit bots have found their way here.	[reply]

Updates