We scan jillion paper documents. It turns out the scanner software (ecopy) can embed ocr info the pdf as "another layer".
I used evince and kpdf, and they can read the ocr data (pretty good) which resides in a different "layer" within the pdf document. I can use xpdf's pdftotext to get out the stuff ocr put in there. I can use evince and kpdf to search inside the document, pretty wild.At first I thought maybe Image::Exif might get the stuff out- And I get a lot of info, but no text..??? (I thought maybe this is what was up) How do I access the stuff inside in some perlish way? That way I could index the entire jillion document archive.. ? I'm getting confused searching for some PDF module to read the text within.. Is there some other way I should do this, since it's kind of a funny thing they do embedding text under the image..
In reply to read pdf text in hidden layer? by leocharre
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |