Re: Extracting content text from PDFs

Hi pat_mc,

If I was you I would try CAM::PDF using the getpdftext.pl as a starting point. I have had great success, however recently a couple of problematic PDFs were made available for me for testing using this method, and I had problems. Since this was the first time extracting text from PDFs using the method had given me problems, I still suggest at least trying CAM::PDF.

If the PDFs in question are made up of images (I have seen PDFs were each page is actually an embedded TIFF resulting from a scanning process) you will need to OCR each page, see Re: parse content of PDF file for further details.

You may want to read Extracting text from PDF. No really on this subject. Also remember super search is your friend, this question seems to be asked frequently.

Hope this helps

Martin

Comment on Re: Extracting content text from PDFs

Replies are listed 'Best First'.
Re^2: Extracting content text from PDFs by pat_mc (Pilgrim) on Sep 12, 2008 at 14:07 UTC
marto - Thanks for your extremely helpful post ... and apologies for not having responded to it any earlier. My experience was exactly the one clinton describes in the thead you reference: modules like `CAM-PDF` only produce mildly helpful output. I am very grateful for the reference to the Linux tool `pdftotext`. With the option `-htmlmeta` it produces extremely useful, tagged output from a given PDF. This is precisely what I have been looking for in a long time. I will intensify my efforts related to this utility from now on. Thanks again! Pat	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Extracting content text from PDFs
by pat_mc (Pilgrim) on Sep 12, 2008 at 14:07 UTC

CAM-PDF

pdftotext

-htmlmeta

[reply]
[d/l]
[select]