Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I've had some success in the past extracting PDFs to HTML using the "pdftohtml" command-line utility.

With the PDFs I'm looking at now ... not so much.

What can I do to get started parsing PDFs with Perl and extracting the text from them in any usable way? The whole PDF::* hierarchy on CPAN is a bit of a mystery to me.



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Extracting text from PDFs
by GrandFather (Saint) on Jan 10, 2007 at 00:25 UTC
Re: Extracting text from PDFs
by holli (Abbot) on Jan 10, 2007 at 10:07 UTC
    With the PDFs I'm looking at now ... not so much.
    Be aware that there are a lot pdf's out there that consist of images only, instead of "text in pdf format". It may be that this is the case with your files, too. To test: Fire up Acrobat and see if you can select, copy & paste some text. If not, you cannot extract text without using OCR.

    caveat: copy & pasting from a document may also be impossible if the usage of the pdf is restricted.


    holli, /regexed monk/