in reply to pdf with too long lines
pdftohtml -xml will give you the necessary data in a very simple XML.
So simple it's easily parsable by regex.
You'll see <text...> tags defining boxes enclosing lines.
Those have point coordinates for left and width . That's enough to identify pages with text outside page boundaries.
Compare Parsing PDFs by text position?
Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: pdf with too long lines
by Dirk80 (Pilgrim) on Aug 23, 2024 at 07:05 UTC | |
by LanX (Saint) on Aug 23, 2024 at 10:53 UTC |