Extracting text from PDFs

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I've had some success in the past extracting PDFs to HTML using the "pdftohtml" command-line utility.

With the PDFs I'm looking at now ... not so much.

What can I do to get started parsing PDFs with Perl and extracting the text from them in any usable way? The whole PDF::* hierarchy on CPAN is a bit of a mystery to me.

($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Comment on Extracting text from PDFs

Replies are listed 'Best First'.
Re: Extracting text from PDFs by GrandFather (Saint) on Jan 10, 2007 at 00:25 UTC
This has come up from time to time in the past. Super Search is your friend. A search on pdf extract text turns up extract text from pdf, Converting PDF to plain text, how to extract text from PDF, Text from PDF and Can I convert a pdf to html with PDF::Extract?? - many of which have replies suggesting the OP use Super Search. :D DWIM is Perl's answer to Gödel	[reply]
Re: Extracting text from PDFs by holli (Abbot) on Jan 10, 2007 at 10:07 UTC
With the PDFs I'm looking at now ... not so much. Be aware that there are a lot pdf's out there that consist of images only, instead of "text in pdf format". It may be that this is the case with your files, too. To test: Fire up Acrobat and see if you can select, copy & paste some text. If not, you cannot extract text without using OCR. caveat: copy & pasting from a document may also be impossible if the usage of the pdf is restricted. holli, /regexed monk/	[reply]