how to extract text from PDF

arunmep has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to extract text from PDF by marto (Cardinal) on Sep 14, 2005 at 11:17 UTC
Hi, Do a Super Search of this topic, it has been covered before. pdftotext is a non Perl way to extract text from PDF files which you could call from a Perl script. Have a look at the modules on Cpan and see if any of them fit your requirements. Martin	[reply]
Re: how to extract text from PDF by tbone1 (Monsignor) on Sep 14, 2005 at 12:51 UTC
With a little supersearching, you could have found taht pdftotext is part of the xpdf package. I've used it since late 2002, and the only problems I've had arise from one particular organization (a government agency, go figure) making changes that were not obvious, and doing odd, possibly nonstandard things with their formatting. pdftotext works well, but you have to watch your source of the data, particularly if that source isn't trustworthy. Although, come to think of it, that's true in all areas of my job. -- tbone1, YAPS (Yet Another Perl Schlub) And remember, if he succeeds, so what. - Chick McGee	[reply]
Re: how to extract text from PDF by newroz (Monk) on Sep 14, 2005 at 11:24 UTC
Hi, If purpose is developing a search engine use ht-Dig. Here lies how to index pdfs . As an alternative swish-e But how to do this with perl? I don't know an efficient way.	[reply]
Re: how to extract text from PDF by blazar (Canon) on Sep 14, 2005 at 11:55 UTC
Nothing that I've tried myself, but the general hint is: search CPAN for something suitable - search results here.	[reply]