LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

Image::Magick has so many features that I'm not sure if I missed that one.

I want to cut out segments from a scanned text with tabular data, the "borders" are just white.

Unfortunately the quality is too bad for OCR (tesseract fails) and the font size, line height and table start varies.

I know how I would go algorithmically - that is counting pixels in rows and columns and guessing the tables geometry with a histogram.

I'm sure Image::Magick has more elaborate possibilities to go there ... any suggestions?

TIA! :)

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re: Segmenting an image into table cells?
by Anonymous Monk on Nov 03, 2015 at 02:12 UTC
Re: Segmenting an image into table cells?
by LanX (Saint) on Nov 30, 2015 at 00:04 UTC
    Answering my own question:

    Image::Magick allows to crop rows and columns, trimming the result can show if they are "empty".

    $x = $p->Crop(geometry=>'439x4+166+167'); warn "$x" if "$x"; $x = $p->Trim(); warn "$x" if "$x"; print join ",",$p->Get(qw/columns rows page/);

    output

    Exception 310: geometry does not contain image `216.PNG' @ warning/att +ribute.c/GetImageBoundingBox/239 at /tmp/tst_im.pl line 36. 1,1,1238x1688-1-1

    so a clever combination of crops and trims can quite effectively dissect a table into it's cells.

    Of course one needs reasonable assumptions about the geometry.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!