in reply to Re^2: PDF::OCR2 results not what I was hoping for
in thread PDF::OCR2 results not what I was hoping for

Reading the documentation of PDF::OCR2, I get the impression that it converts the PDF pages into separate image files using PDF::GetImages and then uses Image::OCR::Tesseract to get the text from the image.

I would change that to add a cropping step in between, which selects only the "interesting" part of the image.

  • Comment on Re^3: PDF::OCR2 results not what I was hoping for

Replies are listed 'Best First'.
Re^4: PDF::OCR2 results not what I was hoping for
by nysus (Parson) on Feb 08, 2016 at 18:35 UTC

    Bam! Got it. I set the "density" setting to "300x300" when reading the image in, by default it is set to 72 dpi.

    PDF::OCR2 is now reading the text in the cropped rectangle flawlessly.

    Thanks for pointing me in the right direction.

    Here is the sample code:

    use Image::Magick; use PDF::OCR2; my $image = Image::Magick->new; $image->Set(density=>'300x300'); $image->Read('agendas/2016-02-02 Natural Resources.pdf', compression=> +'None'); $image->Crop(geometry=>'1248x520+936+520'); $image->Write(filename=>'crop.pdf', compression=>'None'); my $p = PDF::OCR2->new('crop.pdf'); my $text_all = $p->text; print $text_all;

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
    $nysus = $PM . $MCF;
    Click here if you love Perl Monks

Re^4: PDF::OCR2 results not what I was hoping for
by nysus (Parson) on Feb 08, 2016 at 18:17 UTC
    Thanks, yeah, I'm getting very close now. I'm at least getting some usable output after using Image::Magick to crop the pdf. The only problem I'm having is that imagemagick seems to read in the image at very low quality so the OCR results are suboptimal. Hopefully there is some setting I can use to address this.

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
    $nysus = $PM . $MCF;
    Click here if you love Perl Monks