in reply to pdf -> text
I used pdftotext (part of xpdf, http://www.foolabs.com/xpdf/) for a client's search engine. Yes, you have to spawn a process, but pdftotext is rather fast and works nicely. Since the search engine is reindexing the site twice daily, I cache pdftotext's output in a text file, whose timestamp I compare to the PDF file, so most of the time I only have to slurp in the cached text file.
|
|---|