Re: Parse PDF to text

Hi.

We use the "poppler" library (http://poppler.freedesktop.org/) to extract the text of PDFs (several hundreds of them per day), with generally very good results. You still have to process the resulting text to extract what you want though.

But you should be aware that not all PDFs "are" text. Many of the documents presented as PDF and looking like text, are in fact a scanned image of a text, embedded in a PDF. There can also be a mixture of real text and text images in the same PDF. None of the "PDF text extractors" will help you with those, and the only real way to deal with them is to reconvert them to an image, and do OCR on them.

Comment on Re: Parse PDF to text