PDF search term highlighting

snellm has asked for the wisdom of the Perl Monks concerning the following question:

Has anybody got highlight in PDF files working, preferably in a UNIX/Perl environment?

The procedure for passing an XML file to the Acrobat Reader seems straightforward enough, but generating the XML file is tricky - I can't find any tools that are able to calculate page numbers and character offsets given a PDF file and a set of keywords.

-- Michael Snell
-- michael@snell.com

Comment on PDF search term highlighting

Replies are listed 'Best First'.
Re: PDF search term highlighting by traveler (Parson) on Nov 19, 2002 at 16:40 UTC
You should look at xpdf. It contains pdf2txt that converts pdf to text. This is used by the python tool pdfSearch that seems to come close to what you want. HTH, --traveler	[reply]
Re: Re: PDF search term highlighting by snellm (Monk) on Nov 19, 2002 at 17:01 UTC
I'm not sure this is useful - I already use pdf2txt in another context. The problem is that I need to know the page number and offset (ie nth char) of the words to highlight. pdf2txt doesn't retain this information - it simply returns all the text in the PDF. -- Michael Snell -- michael@snell.com	[reply]
Re: Re: Re: PDF search term highlighting by traveler (Parson) on Nov 19, 2002 at 17:44 UTC
I know that pdf2txt only outputs the text. Absent another solution, though, that code may for the basis for a perl module you could write that would preserve the necessary information. --traveler	[reply]
Re: PDF search term highlighting by TheHobbit (Pilgrim) on Nov 19, 2002 at 17:59 UTC
Hi, There are realy a lot of modules to handle XML input/output... I think you may have a look at XML::Parser. Hoping this helps... Cheers Leo TheHobbit	[reply]
Re: PDF search term highlighting by snellm (Monk) on Nov 22, 2002 at 12:18 UTC
Perhaps I didn't phrase the question correctly: I have no problem with XML per se - the problem is that I don't know how to find the page number and offset of a given keyword in a PDF file. -- Michael Snell -- michael@snell.com	[reply]