Text from PDF

colding has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Text from PDF by gellyfish (Monsignor) on Oct 26, 2004 at 16:34 UTC
You could use `ps2ascii` which is a tool that uses the GhostScript tools. You can get versions for both windows and unix. /J\	[reply] [d/l]
Re: Text from PDF by steves (Curate) on Oct 26, 2004 at 17:16 UTC
PDF::FDF::Simple claims to be able to extract some subset of text from PDF files to strings, although I have never personally used it. I'd be interested to hear how capable it is for this task if you decide to try it.	[reply]
Re: Text from PDF by Popcorn Dave (Abbot) on Oct 26, 2004 at 16:28 UTC
This node may be of help to you. Adobe has an online utility that will turn a PDF to text and you can parse it from there. Hope that helps! Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.	[reply]
Re: Text from PDF by saberworks (Curate) on Oct 26, 2004 at 16:57 UTC
If you don't need perl you can use the linux utility pdftotext and it will extract out all the text, and you can use perl to parse from there.	[reply]
Re: Text from PDF by punch_card_don (Curate) on Oct 26, 2004 at 17:36 UTC
IF, and this is a big 'if', your needs are for a limited number of documents and the real objective is just getting the text (as opposed to the real objective being developing a text extraction tool for long-term use), for example for indexing, then a very low-tech solution might suffice: open the pdf in Acrobat (not Acrobat Reader, Acrobat) under 'View' select 'Continuous' under 'Edit' click 'Select All' copy & paste Takes about 20-seconds per document. The math on time investment is easily done. I once used it instead of developing a module to extract pdf text for indexing pdf files in an index-based search engine before direct pdf indexing was commonplace.	[reply]
Re: Text from PDF by dragonchild (Archbishop) on Oct 27, 2004 at 13:04 UTC
PDF::Extract seems to be where you want to look. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re^2: Text from PDF by steves (Curate) on Oct 27, 2004 at 18:19 UTC
The docs for PDF::Extract seem to indicate that it just pulls pages out of an existing PDF document and creates new PDF documents out of these subsets.	[reply]
Re: Text from PDF by steves (Curate) on Oct 27, 2004 at 10:08 UTC
I played around with PDF::FDF::Simple and I couldn't get it to extract text from PDF files. I thought that FDF was just a subset of PDF but there must be more to it than that. Then I looked around for free PDF-to-text tools and was surprised to find that there aren't many that are truly free. Ghostscript may be your best free option. It apparently has a tool for getting text from PDF documents. Another one I found is a Java tool named PDFBox.	[reply]
Re: Text from PDF by Anonymous Monk on Oct 26, 2004 at 21:56 UTC
Thanx for all the replies. For the record: (1) I've looked at other utils, but wanted a perl solution; (2) saw the FDF module & don't know that I want to bring it in since I've never heard of FDF's; (3) I want a script to do what Adobe does (badly) by saving the pdf to text. Also for the record: I give up. I'm gonna buy somethin'	[reply]