hungrystarfish has asked for the wisdom of the Perl Monks concerning the following question:

Hi there

I'm trying to find a way to extract the body text from PDF documents. The modules on CPAN (specifically PDF) only seem to be able to get the document meta data not the actual text itself. Is this correct? Has anyone actually used it? Are there any other modules that can do what I want?

Any help you can give this Perl novice would be gratefully received.

Cheers!

Replies are listed 'Best First'.
Re: PDF Parser
by Courage (Parson) on Jul 04, 2002 at 17:17 UTC
    Actually modules on CPAN that work with PDF are able to parse deeper than just document info.

    PDF GetInfo( has discussion that may be of interest to you.

    Also, answer to your question depends on whether you want just parse PDF using pure perl, or may be use Adobe Acrobat via Win32::OLE and then sending miscellaneous commands to it.

    And finally, two words from my own experience on this:
    all those streams inside PDF file could be easily extracted and converted into text via Compress::Zlib perl module without problems. I am not sure it is helpfull, but 100% possible.

    Courage, the Cowardly Dog.

Re: PDF Parser
by amphiplex (Monk) on Jul 04, 2002 at 16:48 UTC
    I am using pdftotext, but have never used a perl module. You can get it from www.foolabs.com

    ---- kurt
Re: PDF Parser
by traveler (Parson) on Jul 04, 2002 at 18:25 UTC
    I think the tools in PDF::API2 should do it, but I have not tried to write the code. I'd take amphiplex's suggestion and look at the code from xpdf. Then I'd use the PDF::API2 tools to rewrite the pdf2text. Then I'd post it to perlmonks or CPAN.

    HTH, --traveler

Re: PDF Parser
by hungrystarfish (Initiate) on Jul 05, 2002 at 08:25 UTC
    Cheers guys, will give xpdf a try. I'll let you know how I get on.

    JCH