Re: Extracting content text from PDFs

Funny.. I was just updating PDF::OCR. Let me update the package first. I have PDF::GetImage and Image::OCR::Tesseract to update, then PDF::OCR.

It's pretty well tested, I use it a lot at work. If other people were to use it, I could get technical feedback to make it better.

I also have an indexer that records all text content and an interface to search it. Thus, you can have a million docs scanned in and search text content- then it tells you the file location, the page, and line number. That part is a little more complex, because indexing has to be done in parallel with multiple cpus- otherwise it would take 30 days for 60k docs.

update

PDF-GetImages-1.07

Image-OCR-Tesseract-1.10

PDF-OCR-1.06

Make sure to see the README, there are other notes and tesseract install help things in there to help out. I suggest you check out the packages individually instead of using cpan.
The whole thing works like a marvel. Take a look at the INSTALL help file, if you need help just email me per instructions.

Comment on Re: Extracting content text from PDFs