leocharre has asked for the wisdom of the Perl Monks concerning the following question:
We scan jillion paper documents. It turns out the scanner software (ecopy) can embed ocr info the pdf as "another layer".
I used evince and kpdf, and they can read the ocr data (pretty good) which resides in a different "layer" within the pdf document. I can use xpdf's pdftotext to get out the stuff ocr put in there. I can use evince and kpdf to search inside the document, pretty wild.At first I thought maybe Image::Exif might get the stuff out- And I get a lot of info, but no text..??? (I thought maybe this is what was up) How do I access the stuff inside in some perlish way? That way I could index the entire jillion document archive.. ? I'm getting confused searching for some PDF module to read the text within.. Is there some other way I should do this, since it's kind of a funny thing they do embedding text under the image..
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: read pdf text in hidden layer?
by marto (Cardinal) on May 03, 2007 at 08:23 UTC | |
by leocharre (Priest) on May 07, 2007 at 15:20 UTC | |
by marto (Cardinal) on May 07, 2007 at 15:40 UTC | |
by leocharre (Priest) on May 11, 2007 at 12:31 UTC |