colding has asked for the wisdom of the Perl Monks concerning the following question:

Is there a module that can extract the text from a PDF? I've looked into PDF::API2 & it only seems to be concerned with adding text.

Replies are listed 'Best First'.
Re: Text from PDF
by steves (Curate) on Oct 26, 2004 at 17:16 UTC

    PDF::FDF::Simple claims to be able to extract some subset of text from PDF files to strings, although I have never personally used it. I'd be interested to hear how capable it is for this task if you decide to try it.

Re: Text from PDF
by gellyfish (Monsignor) on Oct 26, 2004 at 16:34 UTC

    You could use ps2ascii which is a tool that uses the GhostScript tools. You can get versions for both windows and unix.

    /J\

Re: Text from PDF
by Popcorn Dave (Abbot) on Oct 26, 2004 at 16:28 UTC
    This node may be of help to you. Adobe has an online utility that will turn a PDF to text and you can parse it from there.

    Hope that helps!

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: Text from PDF
by saberworks (Curate) on Oct 26, 2004 at 16:57 UTC
    If you don't need perl you can use the linux utility pdftotext and it will extract out all the text, and you can use perl to parse from there.
Re: Text from PDF
by punch_card_don (Curate) on Oct 26, 2004 at 17:36 UTC
    IF, and this is a big 'if', your needs are for a limited number of documents and the real objective is just getting the text (as opposed to the real objective being developing a text extraction tool for long-term use), for example for indexing,

    then a very low-tech solution might suffice:

    • open the pdf in Acrobat (not Acrobat Reader, Acrobat)
    • under 'View' select 'Continuous'
    • under 'Edit' click 'Select All'
    • copy & paste
    Takes about 20-seconds per document. The math on time investment is easily done. I once used it instead of developing a module to extract pdf text for indexing pdf files in an index-based search engine before direct pdf indexing was commonplace.
Re: Text from PDF
by dragonchild (Archbishop) on Oct 27, 2004 at 13:04 UTC
    PDF::Extract seems to be where you want to look.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      The docs for PDF::Extract seem to indicate that it just pulls pages out of an existing PDF document and creates new PDF documents out of these subsets.

Re: Text from PDF
by Anonymous Monk on Oct 26, 2004 at 21:56 UTC
    Thanx for all the replies. For the record: (1) I've looked at other utils, but wanted a perl solution; (2) saw the FDF module & don't know that I want to bring it in since I've never heard of FDF's; (3) I want a script to do what Adobe does (badly) by saving the pdf to text. Also for the record: I give up. I'm gonna buy somethin'
Re: Text from PDF
by steves (Curate) on Oct 27, 2004 at 10:08 UTC

    I played around with PDF::FDF::Simple and I couldn't get it to extract text from PDF files. I thought that FDF was just a subset of PDF but there must be more to it than that. Then I looked around for free PDF-to-text tools and was surprised to find that there aren't many that are truly free. Ghostscript may be your best free option. It apparently has a tool for getting text from PDF documents. Another one I found is a Java tool named PDFBox.