PDF Parser

hungrystarfish has asked for the wisdom of the Perl Monks concerning the following question:

Hi there

I'm trying to find a way to extract the body text from PDF documents. The modules on CPAN (specifically PDF) only seem to be able to get the document meta data not the actual text itself. Is this correct? Has anyone actually used it? Are there any other modules that can do what I want?

Any help you can give this Perl novice would be gratefully received.

Cheers!

Comment on PDF Parser

Replies are listed 'Best First'.
Re: PDF Parser by Courage (Parson) on Jul 04, 2002 at 17:17 UTC
Actually modules on CPAN that work with PDF are able to parse deeper than just document info. PDF GetInfo( has discussion that may be of interest to you. Also, answer to your question depends on whether you want just parse PDF using pure perl, or may be use Adobe Acrobat via Win32::OLE and then sending miscellaneous commands to it. And finally, two words from my own experience on this: all those streams inside PDF file could be easily extracted and converted into text via Compress::Zlib perl module without problems. I am not sure it is helpfull, but 100% possible. Courage, the Cowardly Dog.	[reply]
Re: PDF Parser by amphiplex (Monk) on Jul 04, 2002 at 16:48 UTC
I am using `pdftotext`, but have never used a perl module. You can get it from www.foolabs.com ---- kurt	[reply]
Re: PDF Parser by traveler (Parson) on Jul 04, 2002 at 18:25 UTC
I think the tools in PDF::API2 should do it, but I have not tried to write the code. I'd take amphiplex's suggestion and look at the code from xpdf. Then I'd use the PDF::API2 tools to rewrite the pdf2text. Then I'd post it to perlmonks or CPAN. HTH, --traveler	[reply]
Re: PDF Parser by hungrystarfish (Initiate) on Jul 05, 2002 at 08:25 UTC
Cheers guys, will give xpdf a try. I'll let you know how I get on. JCH	[reply]