in reply to Extracting content text from PDFs
It's pretty well tested, I use it a lot at work. If other people were to use it, I could get technical feedback to make it better.
I also have an indexer that records all text content and an interface to search it. Thus, you can have a million docs scanned in and search text content- then it tells you the file location, the page, and line number. That part is a little more complex, because indexing has to be done in parallel with multiple cpus- otherwise it would take 30 days for 60k docs.
updateMake sure to see the README, there are other notes and tesseract install help things in there to help out. I suggest you check out the packages individually instead of using cpan.
The whole thing works like a marvel. Take a look at the INSTALL help file, if you need help just email me per instructions.
|
|---|