For collecting words from pdf documents, you can use the ps2ascii utility which comes with ghostscript. It executes the document with ghostscript, using a special device that outputs only ascii text. As ghostscript can handle pdfs too, ps2ascii works fine on them (although I did have some compatibility problems with some pdfs, depending on the generating program and the version of ghostscript).
This doesn't work for word documents of course.
In reply to Re^3: Perl variant of linux tool strings
by ambrus
in thread Perl variant of linux tool strings
by jeanluca
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |