heezy has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I have searched around a lot on Perlmonks, CPAN and google for modules that will enable me to grab text out of a pdf document and then save it into a file

I found the following things

But none of these have answred my question! The two perl monks post don't actually have any resources but are just rants, flaming and peoples opinions. I thought the PDF::API2 module could solve my problems as it has a "stringify" method but this just returns the ASCII of the pdf in it's raw form! Still encoded and wierd!

I need to do this programatically as I need to extract the first 400 words of 4,500 pdf documents to create an abstract to describe the docs. If this were less docs I would copy and paste by opening each but there is no way I am doing this for 4,500 documents!

Thanks people

I hope someone can help!

M

(running on Solaris 9, SPARC etc..)

Replies are listed 'Best First'.
Re: pdf -> text
by crenz (Priest) on Mar 13, 2003 at 22:01 UTC

    I used pdftotext (part of xpdf, http://www.foolabs.com/xpdf/) for a client's search engine. Yes, you have to spawn a process, but pdftotext is rather fast and works nicely. Since the search engine is reindexing the site twice daily, I cache pdftotext's output in a text file, whose timestamp I compare to the PDF file, so most of the time I only have to slurp in the cached text file.

Re: pdf -> text
by traveler (Parson) on Mar 13, 2003 at 21:39 UTC
    This is not a pure perl solution, but it might help. You could use pdf2ps that comes with ghostscript to convert to postscript, then use ps2txt to get the text. Perl should help with the "first 400 words" part.

    HTH, --traveler

    Update: crenz is right. I knew I'd done it simply, but could not find the code. pdftotext is is a good solution.