read pdf text in hidden layer?

leocharre has asked for the wisdom of the Perl Monks concerning the following question:

We scan jillion paper documents. It turns out the scanner software (ecopy) can embed ocr info the pdf as "another layer".

I used evince and kpdf, and they can read the ocr data (pretty good) which resides in a different "layer" within the pdf document. I can use xpdf's pdftotext to get out the stuff ocr put in there. I can use evince and kpdf to search inside the document, pretty wild.

At first I thought maybe Image::Exif might get the stuff out- And I get a lot of info, but no text..??? (I thought maybe this is what was up) How do I access the stuff inside in some perlish way? That way I could index the entire jillion document archive.. ? I'm getting confused searching for some PDF module to read the text within.. Is there some other way I should do this, since it's kind of a funny thing they do embedding text under the image..

Comment on read pdf text in hidden layer?

Replies are listed 'Best First'.
Re: read pdf text in hidden layer? by marto (Cardinal) on May 03, 2007 at 08:23 UTC
You should take a look at CAM::PDF. I think you will find it useful. Thanks Martin	[reply]
Re^2: read pdf text in hidden layer? by leocharre (Priest) on May 07, 2007 at 15:20 UTC
I can't even instanciate an object... `my $abs = '/var/doc/Towson/AP/IA/1 - VERIZON -@APIA.pdf'; my $pdf = CAM::PDF->new($abs) or die("CAM PDF returns nothing");` [download] What's up with this module? no errors.. no warnings.. nothing??? And the documentation is a wreck. Looked so promissing. It's really sad- a lot of work went into CAM::PDF, I hope they revisit the pod.	[reply] [d/l]
Re^3: read pdf text in hidden layer? by marto (Cardinal) on May 07, 2007 at 15:40 UTC
Really? Seems to work ok for me. Quick test: `#!/usr/bin/perl use strict; use warnings; use CAM::PDF; my $input='E:\vecguid.pdf'; my $output='E:\Test.pdf'; my $pdf = CAM::PDF->new($input) or die "$CAM::PDF::errstr\n"; $pdf->output($output);` [download] Did you try any of the examples scripts? Martin	[reply] [d/l]
Re^4: read pdf text in hidden layer? by leocharre (Priest) on May 11, 2007 at 12:31 UTC