I had good results (for a couple of pages only) with the OCR approach to extracting text from PDF. I was impressed it worked also relatively well for equations, extracting them as latex. I have used a demo-copy of a commercial software (run in linux via wine) called InftyReader, it allows only 5 pages of text per day. But you may want to test your mileage. I only had 2 pages to do and it was a very high quality pdf document produced by latex whose source we lost.
For setting your own OCR engine there is Tesseract and there are Perl modules (e.g. Image::OCR::Tesseract) to interact with it. Or you may prefer to interface to it with opencv (c++) which will also give you access to its vast library of image processing algorithms for de-noising etc.
I have not done it myself in a large scale but only to play and that was a few years back. I remember it was "difficult" to set up. It would be interesting to see if that works for you.
The important thing with Tesseract is that it allows for training and learning on sampled text. So, if your text volume is huge so as to justify the investment and is relatively constant on fonts and layout, you may be lucky and create something which works beyond 90% success.
Update: in the case of color-highlighted text, OCR will work super because you can do image pre-processing and separate text wrt color or even wrt to font and its attributes: bold or italic. Which means that combining the OCR approach and the source-code-reversal approach we usually try with pdfto* will give you extra power.
bw, bliako
In reply to Re: Read highlighted text from PDF
by bliako
in thread Read highlighted text from PDF
by IB2017
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |