in reply to Re: PDF::OCR2 results not what I was hoping for
in thread PDF::OCR2 results not what I was hoping for

I didn't see anything in the PDF::OCR2 documentation that allowed you to just scan a portion of the document. How would I do this?

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

  • Comment on Re^2: PDF::OCR2 results not what I was hoping for

Replies are listed 'Best First'.
Re^3: PDF::OCR2 results not what I was hoping for
by Corion (Patriarch) on Feb 08, 2016 at 17:02 UTC

    Reading the documentation of PDF::OCR2, I get the impression that it converts the PDF pages into separate image files using PDF::GetImages and then uses Image::OCR::Tesseract to get the text from the image.

    I would change that to add a cropping step in between, which selects only the "interesting" part of the image.

      Bam! Got it. I set the "density" setting to "300x300" when reading the image in, by default it is set to 72 dpi.

      PDF::OCR2 is now reading the text in the cropped rectangle flawlessly.

      Thanks for pointing me in the right direction.

      Here is the sample code:

      use Image::Magick; use PDF::OCR2; my $image = Image::Magick->new; $image->Set(density=>'300x300'); $image->Read('agendas/2016-02-02 Natural Resources.pdf', compression=> +'None'); $image->Crop(geometry=>'1248x520+936+520'); $image->Write(filename=>'crop.pdf', compression=>'None'); my $p = PDF::OCR2->new('crop.pdf'); my $text_all = $p->text; print $text_all;

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
      $nysus = $PM . $MCF;
      Click here if you love Perl Monks

      Thanks, yeah, I'm getting very close now. I'm at least getting some usable output after using Image::Magick to crop the pdf. The only problem I'm having is that imagemagick seems to read in the image at very low quality so the OCR results are suboptimal. Hopefully there is some setting I can use to address this.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
      $nysus = $PM . $MCF;
      Click here if you love Perl Monks