arunmep has asked for the wisdom of the Perl Monks concerning the following question:

hi everybody iam developing a search engine and i need to extract the text from PDF files using perl can anybody suggest me a way thankyou

Edit: g0n - corrected spelling in title of 'how'

Replies are listed 'Best First'.
Re: how to extract text from PDF
by marto (Cardinal) on Sep 14, 2005 at 11:17 UTC
    Hi,

    Do a Super Search of this topic, it has been covered before.
    pdftotext is a non Perl way to extract text from PDF files which you could call from a Perl script.
    Have a look at the modules on Cpan and see if any of them fit your requirements.

    Martin
Re: how to extract text from PDF
by tbone1 (Monsignor) on Sep 14, 2005 at 12:51 UTC
    With a little supersearching, you could have found taht pdftotext is part of the xpdf package.

    I've used it since late 2002, and the only problems I've had arise from one particular organization (a government agency, go figure) making changes that were not obvious, and doing odd, possibly nonstandard things with their formatting.

    pdftotext works well, but you have to watch your source of the data, particularly if that source isn't trustworthy. Although, come to think of it, that's true in all areas of my job.

    --
    tbone1, YAPS (Yet Another Perl Schlub)
    And remember, if he succeeds, so what.
    - Chick McGee

Re: how to extract text from PDF
by newroz (Monk) on Sep 14, 2005 at 11:24 UTC
    Hi, If purpose is developing a search engine use
    ht-Dig.
    Here lies how to index pdfs .
    As an alternative swish-e
    But how to do this with perl? I don't know an efficient way.
Re: how to extract text from PDF
by blazar (Canon) on Sep 14, 2005 at 11:55 UTC
    Nothing that I've tried myself, but the general hint is: search CPAN for something suitable - search results here.