in reply to Extracting content text from PDFs

Hi pat_mc,

If I was you I would try CAM::PDF using the getpdftext.pl as a starting point. I have had great success, however recently a couple of problematic PDFs were made available for me for testing using this method, and I had problems. Since this was the first time extracting text from PDFs using the method had given me problems, I still suggest at least trying CAM::PDF.

If the PDFs in question are made up of images (I have seen PDFs were each page is actually an embedded TIFF resulting from a scanning process) you will need to OCR each page, see Re: parse content of PDF file for further details.

You may want to read Extracting text from PDF. No really on this subject. Also remember super search is your friend, this question seems to be asked frequently.

Hope this helps

Martin

Replies are listed 'Best First'.
Re^2: Extracting content text from PDFs
by pat_mc (Pilgrim) on Sep 12, 2008 at 14:07 UTC
    marto -

    Thanks for your extremely helpful post ... and apologies for not having responded to it any earlier. My experience was exactly the one clinton describes in the thead you reference: modules like CAM-PDF only produce mildly helpful output. I am very grateful for the reference to the Linux tool pdftotext. With the option -htmlmeta it produces extremely useful, tagged output from a given PDF. This is precisely what I have been looking for in a long time. I will intensify my efforts related to this utility from now on.

    Thanks again!

    Pat