in reply to PDF::OCR2 results not what I was hoping for

If you're trying OCR on a form, I think the best approach is to pre-segment the different areas where text appears. I found multi-column (or in your case, even multi-box) text to be highly confusing for the OCR programs I tried.

As what you have is basically a form with more or less fixed offsets, I would try to extract the rectangle within which date/time/location appear and then do OCR on these strings. Also look into the settings of your OCR to find whether you can specify a sans-serif font.

  • Comment on Re: PDF::OCR2 results not what I was hoping for

Replies are listed 'Best First'.
Re^2: PDF::OCR2 results not what I was hoping for
by nysus (Parson) on Feb 08, 2016 at 16:55 UTC
    I didn't see anything in the PDF::OCR2 documentation that allowed you to just scan a portion of the document. How would I do this?

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
    $nysus = $PM . $MCF;
    Click here if you love Perl Monks

      Reading the documentation of PDF::OCR2, I get the impression that it converts the PDF pages into separate image files using PDF::GetImages and then uses Image::OCR::Tesseract to get the text from the image.

      I would change that to add a cropping step in between, which selects only the "interesting" part of the image.

        Bam! Got it. I set the "density" setting to "300x300" when reading the image in, by default it is set to 72 dpi.

        PDF::OCR2 is now reading the text in the cropped rectangle flawlessly.

        Thanks for pointing me in the right direction.

        Here is the sample code:

        use Image::Magick; use PDF::OCR2; my $image = Image::Magick->new; $image->Set(density=>'300x300'); $image->Read('agendas/2016-02-02 Natural Resources.pdf', compression=> +'None'); $image->Crop(geometry=>'1248x520+936+520'); $image->Write(filename=>'crop.pdf', compression=>'None'); my $p = PDF::OCR2->new('crop.pdf'); my $text_all = $p->text; print $text_all;

        $PM = "Perl Monk's";
        $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
        $nysus = $PM . $MCF;
        Click here if you love Perl Monks

        Thanks, yeah, I'm getting very close now. I'm at least getting some usable output after using Image::Magick to crop the pdf. The only problem I'm having is that imagemagick seems to read in the image at very low quality so the OCR results are suboptimal. Hopefully there is some setting I can use to address this.

        $PM = "Perl Monk's";
        $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
        $nysus = $PM . $MCF;
        Click here if you love Perl Monks

Re^2: PDF::OCR2 results not what I was hoping for
by nysus (Parson) on Feb 08, 2016 at 17:03 UTC
    Maybe I would use imagemagick to crop the pdf?

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
    $nysus = $PM . $MCF;
    Click here if you love Perl Monks