| [reply] |
marto -
Thanks for your extremely helpful post ... and apologies for not having responded to it any earlier. My experience was exactly the one clinton describes in the thead you reference: modules like CAM-PDF only produce mildly helpful output. I am very grateful for the reference to the Linux tool pdftotext. With the option -htmlmeta it produces extremely useful, tagged output from a given PDF. This is precisely what I have been looking for in a long time. I will intensify my efforts related to this utility from now on.
Thanks again!
Pat
| [reply] [d/l] [select] |
Funny.. I was just updating PDF::OCR.
Let me update the package first. I have PDF::GetImage and Image::OCR::Tesseract to update, then PDF::OCR.
It's pretty well tested, I use it a lot at work. If other people were to use it, I could get technical feedback to make it better.
I also have an indexer that records all text content and an interface to search it. Thus, you can have a million docs scanned in and search text content- then it tells you the file location, the page, and line number. That part is a little more complex, because indexing has to be done in parallel with multiple cpus- otherwise it would take 30 days for 60k docs.
update
Make sure to see the README, there are other notes and tesseract install help things in there to help out. I suggest you check out the packages individually instead of using cpan.
The whole thing works like a marvel. Take a look at the INSTALL help file, if you need help just email me per instructions. | [reply] |
| [reply] |
PDF::API2 has a nice little hash with the document info. That makes it easy to put into a database or use otherwise. I've used it with great success to get the info similar to what you are planning.
HTH, --traveler
| [reply] |
Hi, traveler -
Thanks for your suggestion. I have tried the module you suggest ... but unfortunately to no avail. Apart from the fact that it only extracted a fraction of the relevant document information its main drawback was that the stringify method only produced a load of gibberish that flickered across my screen with plenty of beeps. Any idea why this is?
I also wonder what the limitations on the PDF generation as such are that this module is subject to. Can it only handle PDFs which were generated by a certain application or with certain parameters?
Thanks for your help nonetheless and cheers from Hamburg -
Pat
| [reply] [d/l] |
If there are limits to what PDFs work and what don't I have not run into them :)
I have not seen stringify send garbage to the output unless I tried to display a picture. For real text, it seemed to work just fine. I have no idea about those problems as it has worked for the uses to which I have put it.
sorry
| [reply] |