Re: pdf with too long lines

pdftohtml -xml will give you the necessary data in a very simple XML.

So simple it's easily parsable by regex.

You'll see <text...> tags defining boxes enclosing lines.

Those have point coordinates for left and width . That's enough to identify pages with text outside page boundaries.

Cheers Rolf
_{(addicted to the Perl Programming Language :)

see Wikisyntax for the Monastery}

Comment on Re: pdf with too long lines Select or Download Code

Replies are listed 'Best First'.
Re^2: pdf with too long lines by Dirk80 (Pilgrim) on Aug 23, 2024 at 07:05 UTC
Thank you very much. This should be a good way how to do it. I tried it, but I get the error message that my pdf file is version 1.6 and pdftohtml only supports until version 1.5. But I could downgrade it and then use the tool. Everything worked fine. I'm happy with the solution.	[reply]
Re^3: pdf with too long lines by LanX (Saint) on Aug 23, 2024 at 10:53 UTC
Happy it worked :) > pdftohtml only supports until version 1.5. That's surprising me, I remember looking into the C code and AFAIR it's just sending the PDF thru ghostscript and parsing the plot output. Not sure why ghostscript should fault now. 🤷🏻‍♂️ Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]