Extracting content text from PDFs

pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting content text from PDFs by marto (Cardinal) on Apr 01, 2008 at 15:48 UTC
Hi pat_mc, If I was you I would try CAM::PDF using the getpdftext.pl as a starting point. I have had great success, however recently a couple of problematic PDFs were made available for me for testing using this method, and I had problems. Since this was the first time extracting text from PDFs using the method had given me problems, I still suggest at least trying CAM::PDF. If the PDFs in question are made up of images (I have seen PDFs were each page is actually an embedded TIFF resulting from a scanning process) you will need to OCR each page, see Re: parse content of PDF file for further details. You may want to read Extracting text from PDF. No really on this subject. Also remember super search is your friend, this question seems to be asked frequently. Hope this helps Martin	[reply]
Re^2: Extracting content text from PDFs by pat_mc (Pilgrim) on Sep 12, 2008 at 14:07 UTC
marto - Thanks for your extremely helpful post ... and apologies for not having responded to it any earlier. My experience was exactly the one clinton describes in the thead you reference: modules like `CAM-PDF` only produce mildly helpful output. I am very grateful for the reference to the Linux tool `pdftotext`. With the option `-htmlmeta` it produces extremely useful, tagged output from a given PDF. This is precisely what I have been looking for in a long time. I will intensify my efforts related to this utility from now on. Thanks again! Pat	[reply] [d/l] [select]
Re: Extracting content text from PDFs by leocharre (Priest) on Apr 01, 2008 at 16:34 UTC
Funny.. I was just updating PDF::OCR. Let me update the package first. I have PDF::GetImage and Image::OCR::Tesseract to update, then PDF::OCR. It's pretty well tested, I use it a lot at work. If other people were to use it, I could get technical feedback to make it better. I also have an indexer that records all text content and an interface to search it. Thus, you can have a million docs scanned in and search text content- then it tells you the file location, the page, and line number. That part is a little more complex, because indexing has to be done in parallel with multiple cpus- otherwise it would take 30 days for 60k docs. update PDF-GetImages-1.07 Image-OCR-Tesseract-1.10 PDF-OCR-1.06 Make sure to see the README, there are other notes and tesseract install help things in there to help out. I suggest you check out the packages individually instead of using cpan. The whole thing works like a marvel. Take a look at the INSTALL help file, if you need help just email me per instructions.	[reply]
Re: Extracting content text from PDFs by alexm (Chaplain) on Apr 01, 2008 at 16:11 UTC
Chris Dolan, the author of CAM-PDF, has also a nice interface to mount PDF in your filesystem: Fuse-PDF. See the module announced on use.perl.	[reply]
Re: Extracting content text from PDFs by traveler (Parson) on Apr 01, 2008 at 18:52 UTC
PDF::API2 has a nice little hash with the document info. That makes it easy to put into a database or use otherwise. I've used it with great success to get the info similar to what you are planning. HTH, --traveler	[reply]
Re^2: Extracting content text from PDFs by pat_mc (Pilgrim) on Apr 04, 2008 at 10:06 UTC
Hi, traveler - Thanks for your suggestion. I have tried the module you suggest ... but unfortunately to no avail. Apart from the fact that it only extracted a fraction of the relevant document information its main drawback was that the `stringify` method only produced a load of gibberish that flickered across my screen with plenty of beeps. Any idea why this is? I also wonder what the limitations on the PDF generation as such are that this module is subject to. Can it only handle PDFs which were generated by a certain application or with certain parameters? Thanks for your help nonetheless and cheers from Hamburg - Pat	[reply] [d/l]
Re^3: Extracting content text from PDFs by traveler (Parson) on Apr 15, 2008 at 16:40 UTC
If there are limits to what PDFs work and what don't I have not run into them :) I have not seen stringify send garbage to the output unless I tried to display a picture. For real text, it seemed to work just fine. I have no idea about those problems as it has worked for the uses to which I have put it. sorry	[reply]
Re^4: Extracting content text from PDFs by almut (Canon) on Apr 16, 2008 at 03:32 UTC