in reply to pdf with too long lines

pdftohtml -xml will give you the necessary data in a very simple XML.

So simple it's easily parsable by regex.

You'll see <text...> tags defining boxes enclosing lines.

Those have point coordinates for left and width . That's enough to identify pages with text outside page boundaries.

Compare Parsing PDFs by text position?

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^2: pdf with too long lines
by Dirk80 (Pilgrim) on Aug 23, 2024 at 07:05 UTC

    Thank you very much. This should be a good way how to do it.

    I tried it, but I get the error message that my pdf file is version 1.6 and pdftohtml only supports until version 1.5.

    But I could downgrade it and then use the tool. Everything worked fine. I'm happy with the solution.

      Happy it worked :)

      > pdftohtml only supports until version 1.5.

      That's surprising me, I remember looking into the C code and AFAIR it's just sending the PDF thru ghostscript and parsing the plot output.

      Not sure why ghostscript should fault now. 🤷🏻‍♂️

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery