I had good results (for a couple of pages only) with the OCR approach to extracting text from PDF. I was impressed it worked also relatively well for equations, extracting them as latex. I have used a demo-copy of a commercial software (run in linux via wine) called InftyReader, it allows only 5 pages of text per day. But you may want to test your mileage. I only had 2 pages to do and it was a very high quality pdf document produced by latex whose source we lost.

For setting your own OCR engine there is Tesseract and there are Perl modules (e.g. Image::OCR::Tesseract) to interact with it. Or you may prefer to interface to it with opencv (c++) which will also give you access to its vast library of image processing algorithms for de-noising etc.

I have not done it myself in a large scale but only to play and that was a few years back. I remember it was "difficult" to set up. It would be interesting to see if that works for you.

The important thing with Tesseract is that it allows for training and learning on sampled text. So, if your text volume is huge so as to justify the investment and is relatively constant on fonts and layout, you may be lucky and create something which works beyond 90% success.

Update: in the case of color-highlighted text, OCR will work super because you can do image pre-processing and separate text wrt color or even wrt to font and its attributes: bold or italic. Which means that combining the OCR approach and the source-code-reversal approach we usually try with pdfto* will give you extra power.

bw, bliako


In reply to Re: Read highlighted text from PDF by bliako
in thread Read highlighted text from PDF by IB2017

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.