vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I did see, e.g. this node node. However I want to post a more general question.
I am working on a web content retrieval project and use LWP::UserAgent and HTTP::Request to get the text for subsequent processing from a given URL. A lot of URLs which I am getting for http requests are pdf files links, some of them are docs, etc.
Do I need to convert PDFs, DOCs, etc first to HTML or there is a module (or modules) which can get the text from the http response?
If I do need, then which modules?
  • Comment on PDF, DOC, etc to HTML or directly to text?

Replies are listed 'Best First'.
Re: PDF, DOC, etc to HTML or directly to text?
by Anonymous Monk on Jan 01, 2010 at 22:38 UTC
    Do I need to convert PDFs, DOCs, etc first to HTML or there is a module (or modules) which can get the text from the http response?

    Um, the HTTP response are the contents of the file. If the file is PDF/DOC/Image... there is no simple text, so yes, you need modules/programs to convert each to text.

    If I do need, then which modules?

    CPAN is full of candidates you'll have to sort through :) To convert images to text you need to use OCR software.... its probably easier to simply leverage google APIs or (google desktop?...)

        Thanks!! SWISH::Filter looks good, but how should I process a URL?