in reply to Re^2: Perl variant of linux tool strings
in thread Perl variant of linux tool strings

For collecting words from pdf documents, you can use the ps2ascii utility which comes with ghostscript. It executes the document with ghostscript, using a special device that outputs only ascii text. As ghostscript can handle pdfs too, ps2ascii works fine on them (although I did have some compatibility problems with some pdfs, depending on the generating program and the version of ghostscript).

This doesn't work for word documents of course.

  • Comment on Re^3: Perl variant of linux tool strings

Replies are listed 'Best First'.
Re^4 perl variant of linux tool 'strings'
by Random_Walk (Prior) on Mar 23, 2005 at 21:25 UTC

    OP, you may have some luck loading MS Word into (star|open)office, printing to pdf then chucking it at ps2ascii. As it is the exact same formating that is hardest for *office to get correct and ascii has little remmenant of these I guess you could have a lot of luck.

    update

    As ambrus points out below of course if you can read the word doc into *office then you can just export ASCII from there. Sorry, it has been a rather long day

    You may also want to trawl through a list of filters, I found this one which looks like it may have some tools that could help

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!

      If you can load a document to *office, why don't you save it straight as an ascii text or at least any other format that can be parsed easily?

        Thanks a lot Monks!!
        Just to make clear why I need al this: I wrote this cgi-script that manages a database.
        It allows one to add information into it and retrieve it back (via a search form). Adding useful information goes via a form too and might contain HTML just like this one. So when you post something like:

        a href = my_word document.doc ... etc

        With a post containing a word document I like to be able to read this document and be able to create the search-keywords for that post
        But my impressions is, is that this is not so easy!!
        Luca