PDF, DOC, etc to HTML or directly to text?

vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I did see, e.g. this node node. However I want to post a more general question.
I am working on a web content retrieval project and use LWP::UserAgent and HTTP::Request to get the text for subsequent processing from a given URL. A lot of URLs which I am getting for http requests are pdf files links, some of them are docs, etc.
Do I need to convert PDFs, DOCs, etc first to HTML or there is a module (or modules) which can get the text from the http response?
If I do need, then which modules?

Comment on PDF, DOC, etc to HTML or directly to text?

Replies are listed 'Best First'.
Re: PDF, DOC, etc to HTML or directly to text? by Anonymous Monk on Jan 01, 2010 at 22:38 UTC
Do I need to convert PDFs, DOCs, etc first to HTML or there is a module (or modules) which can get the text from the http response? Um, the HTTP response are the contents of the file. If the file is PDF/DOC/Image... there is no simple text, so yes, you need modules/programs to convert each to text. If I do need, then which modules? CPAN is full of candidates you'll have to sort through :) To convert images to text you need to use OCR software.... its probably easier to simply leverage google APIs or (google desktop?...)	[reply]
Re^2: PDF, DOC, etc to HTML or directly to text? by Anonymous Monk on Jan 01, 2010 at 22:46 UTC
KinoSearch Backends for PDF, PPT , DOCX, PPTX..., SWISH::Filter	[reply]
Re^3: PDF, DOC, etc to HTML or directly to text? by vit (Friar) on Jan 02, 2010 at 17:36 UTC
Thanks!! SWISH::Filter looks good, but how should I process a URL?	[reply]