Dirk80 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a pdf file consisting of several hundred pages. Often text lines go beyond the page limit of the A4 format at the right side. So the text is there but not visible.

When I edit the pdf file in the Adobe Acrobat Pro, I can see that text is in a frame that goes beyond the limit. When I manually draw this frame or box smaller, i.e. pulling the right part into the visible area, than the text inside is completely in the visible area.

It's too much work to do this manually and to find all places. The font size of the text can be different and often the text begins somewhere in the middle.

Is there a way to find out with a perl script, where all textboxes are that go beyond the page limit? Or much better could I automatically minimize the right border of all text boxes to be in the visible are plus a small margin?

There are a lot of pdf libraries on CPAN. But I don't really know how to begin here. Thank you very much for your input!

Replies are listed 'Best First'.
Re: pdf with too long lines
by LanX (Saint) on Aug 22, 2024 at 20:08 UTC

    pdftohtml -xml will give you the necessary data in a very simple XML.

    So simple it's easily parsable by regex.

    You'll see <text...> tags defining boxes enclosing lines.

    Those have point coordinates for left and width . That's enough to identify pages with text outside page boundaries.

    Compare Parsing PDFs by text position?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      Thank you very much. This should be a good way how to do it.

      I tried it, but I get the error message that my pdf file is version 1.6 and pdftohtml only supports until version 1.5.

      But I could downgrade it and then use the tool. Everything worked fine. I'm happy with the solution.

        Happy it worked :)

        > pdftohtml only supports until version 1.5.

        That's surprising me, I remember looking into the C code and AFAIR it's just sending the PDF thru ghostscript and parsing the plot output.

        Not sure why ghostscript should fault now. 🤷🏻‍♂️

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

Re: pdf with too long lines (OT)
by LanX (Saint) on Aug 23, 2024 at 13:36 UTC
    I know you already have your solution, but two more comments:

    > Often text lines go beyond the page limit of the A4 format at the right side. So the text is there but not visible.

    (Kind of off topic here :)

    I'm not sure if you're talking about print or online viewing, but this sounds like the document was produced for another paper size like letter ¹.

    So rescaling "to fit" or changing the page format on the creator level should help here.

    Maybe try this inside your "Pro Acrobat" tool ? ²

    > Or much better could I automatically minimize the right border of all text boxes to be in the visible are plus a small margin?

    Hmm this depends on the level of semantic info available. Many tools only have positional informations and lack your/ artificial "intelligence" to tell what should be resized and what not.

    For instance you don't want your page numbers and headings to skip because the main body is reflown.

    Even with an elaborate CPAN module you'd need a lot of try and error.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    Updates

    ¹) PS: metric rules! =D

    ²) there are plenty of online services offering to resize PDF

Re: pdf with too long lines
by perlfan (Parson) on Aug 24, 2024 at 20:21 UTC

      I believe they are majority of humanity (i.e. amongst PC users) who are both east of Greenwich and on Windows. Then the very innocent PDF::Data->new() would make them scratch their heads contemplating 'Use of uninitialized value in division (/)' warning. Because the author (very nice guy, judging by the video) did mktime(gmtime 0). Oh well, this is so harmless a warning compared to far more grave problems.

        And by "grave", I mean was it even tested with anything other than "hello world" level, being at version above "1" already and having made its way to the TPRC, no less. Look, I don't want to be rude, I tried it with something slightly above the kindergarten complexity such as "HigherOrderPerl-trimmed.pdf" (and not "trimmed") or "modern_perl_ebook.pdf" (the latter converted to 1.4 version as explained in the video), then at last very simple "What Every Computer Scientist Should Know About Floating-Point Arithmetic" PDF from the Oracle site -- just off the top off my head -- I'm naming files anyone can google for a test and surely familiar to the audience.

        This module failed parsing all of them.

        Boiled down to, e.g.:

        PDF::Data->new->parse_data( '[([)]' ); Byte offset 0: Parse error on input: "[([)]"

        Because bracket is not expected to be in a string which is element of an array. I'm not saying all the files above failed because of nested arrays parsing regex. I suspect there are similar issues all over.

        Compare:

        use CAM::PDF; use PDF::API2; use Data::Dumper::Concise; print Dumper( CAM::PDF->parseAny( \'[([)]' )); print Dumper( PDF::API2::Basic::PDF::File->new->readval( "[([)]\n" )); bless( { type => "array", value => [ bless( { type => "string", value => "[", }, 'CAM::PDF::Node' ), ], }, 'CAM::PDF::Node' ) bless( { " realised" => 1, " val" => [ bless( { " realised" => 1, val => "[", }, 'PDF::API2::Basic::PDF::String' ), ], }, 'PDF::API2::Basic::PDF::Array' )

        (the fact than one of them expects a string reference, and the other a NL attached for this stand-alone example is not relevant)

        In fact, reading the code (but not documentation), there can be a parameter to constructor "-novalidate", which turns off parsing of content streams (as opposed to PDF structure), which is generally not required anyway. And then the most terrible deja-vu occurs on me -- I have already tried this module several years ago, and already found that I had to turn this flag off to even begin playing with it. Then, having discovered that the entry point (startxref offset) is simply ignored (line #185 comment) and data are just gobbled from the beginning of the file i.e. absolutely not according to the spec, I thought wow.

        And I've not even begun comparing these parsers performance, sadly.

        WTH - I think the hackernews and reddit bots have found their way here.