in reply to PDF::API2 to search for text and place hyperlinks in PDF file

There must be a better way.
Not as far as I know. If you need to find not only the text but the actual positioning of the label containing that text, you're going to have to parse the entire structure. I don't know that doing that using Data::Dumper rather than CAM::PDF or PDF::API2's internal methods is a good idea, but no matter how you slice it, you basically have to mimic the rendering process (parsing the page tree) to get the actual page positions of the text.

And even then, if you are searching for a substring you'll only have the position of the text container, not the position of the substring itself. To get the position of the substring would require actually rendering the PDF, complete with its fonts.
  • Comment on Re: PDF::API2 to search for text in PDF file

Replies are listed 'Best First'.
Re^2: PDF::API2 to search for text in PDF file
by knbknb (Acolyte) on Mar 25, 2009 at 16:47 UTC
    Data::Dumper was only my first quick and dirty solution; I noticed that its output contains lots of stuff, and in places there is often something like

    #(250.00, 650.00) This is some pdftext

    The text in coordinates is presumably the position on the page. With some assumptions (font size is always 10-12, box width can also constrained/guessed meaningfully), I could put a hyperlink there, which would approximately be at the right position.

    Afterwards, manual editing could remove the URL or change it. This would still be much quicker than setting all the URLs manually from scratch. I would also happily switch to a different tool that accomplishes dumping text and position to a text file. For instance, we have acrobat 8 here but I haven't tried its javascript API. A table of

    ### page ### position x,y ### matched text ####

    would suffice for a while. I could use this as input for my script.

    I still don't know what to do with text that wraps around on the page, though. These hyperlinks would be incomplete and hence invalid.