Re: PDF Text

I've not used it, but will underscore the recommendation for swish-e, based on what I've heard about it.

But to answer your specific question, I use pdftotext to extract the ascii text from a compliant pdf file. Its a bash command line tool which is distributed with the xpdf reader application in many linux distributions. It won't work on scanned images (for which that PDF::OCR sounds particularly interesting; I'll have to check that out, ++ and thanks!). But for folks who export editable documents to PDF, it works like a charm (though is challenged a bit by multi-column content).

-- Hugh

if( $lal && $lol ) { $life++; }

Comment on Re: PDF Text

Replies are listed 'Best First'.
Re^2: PDF Text by leocharre (Priest) on Jun 13, 2008 at 13:38 UTC
Something really interesting that happened at my office.. We scan in a lot of documents. Now, the machines are able to encode OCR into the pdf document created. This makes indexing the documents relatively easy. BUT - Guess what! They don't want to use the scanner's OCR tech! Because they say it slows down scanning! And- well for five pages who cares. But for 200 page documents??? They have a point. So I have my thing run at night.. collect info etc. That's why I needed muscle.	[reply]

Replies are listed 'Best First'.

Re^2: PDF Text
by leocharre (Priest) on Jun 13, 2008 at 13:38 UTC

Something really interesting that happened at my office..

We scan in a lot of documents. Now, the machines *are* able to encode OCR into the pdf document created. This makes indexing the documents relatively easy.

BUT - Guess what! They don't want to use the scanner's OCR tech! Because they say it slows down scanning! And- well for five pages who cares. But for 200 page documents???

They have a point.

So I have my thing run at night.. collect info etc.
That's why I needed muscle.

[reply]