leocharre has asked for the wisdom of the Perl Monks concerning the following question:

We scan jillion paper documents. It turns out the scanner software (ecopy) can embed ocr info the pdf as "another layer".

I used evince and kpdf, and they can read the ocr data (pretty good) which resides in a different "layer" within the pdf document. I can use xpdf's pdftotext to get out the stuff ocr put in there. I can use evince and kpdf to search inside the document, pretty wild.

At first I thought maybe Image::Exif might get the stuff out- And I get a lot of info, but no text..??? (I thought maybe this is what was up) How do I access the stuff inside in some perlish way? That way I could index the entire jillion document archive.. ? I'm getting confused searching for some PDF module to read the text within.. Is there some other way I should do this, since it's kind of a funny thing they do embedding text under the image..

Replies are listed 'Best First'.
Re: read pdf text in hidden layer?
by marto (Cardinal) on May 03, 2007 at 08:23 UTC
    You should take a look at CAM::PDF. I think you will find it useful.

    Thanks

    Martin

      I can't even instanciate an object...

      my $abs = '/var/doc/Towson/AP/IA/1 - VERIZON -@APIA.pdf'; my $pdf = CAM::PDF->new($abs) or die("CAM PDF returns nothing");

      What's up with this module? no errors.. no warnings.. nothing??? And the documentation is a wreck. Looked so promissing. It's really sad- a lot of work went into CAM::PDF, I hope they revisit the pod.

        Really? Seems to work ok for me. Quick test:
        #!/usr/bin/perl use strict; use warnings; use CAM::PDF; my $input='E:\vecguid.pdf'; my $output='E:\Test.pdf'; my $pdf = CAM::PDF->new($input) or die "$CAM::PDF::errstr\n"; $pdf->output($output);

        Did you try any of the examples scripts?

        Martin