Hi,
Do a Super Search of this topic, it has been covered before. pdftotext is a non Perl way to extract text from PDF files which you could call from a Perl script. Have a look at the modules on Cpan and see if any of them fit your requirements.
Martin | [reply] |
With a little supersearching, you could have found taht pdftotext is part of the xpdf package.
I've used it since late 2002, and the only problems I've had arise from one particular organization (a government agency, go figure) making changes that were not obvious, and doing odd, possibly nonstandard things with their formatting.
pdftotext works well, but you have to watch your source of the data, particularly if that source isn't trustworthy. Although, come to think of it, that's true in all areas of my job.
--
tbone1, YAPS (Yet Another Perl Schlub)
And remember, if he succeeds, so what.
- Chick McGee
| [reply] |
Hi,
If purpose is developing a search engine use
ht-Dig. Here lies how to
index pdfs . As an alternative swish-e
But how to do this with perl? I don't know an efficient way.
| [reply] |
Nothing that I've tried myself, but the general hint is: search CPAN for something suitable - search results here. | [reply] |