in reply to Re: Perl variant of linux tool strings
in thread Perl variant of linux tool strings

I like to collect words from a pdf or word document! So far Perl Power tools does a very good job! Thanks
  • Comment on Re^2: Perl variant of linux tool strings

Replies are listed 'Best First'.
Re^3: Perl variant of linux tool strings
by ambrus (Abbot) on Mar 23, 2005 at 21:09 UTC

    For collecting words from pdf documents, you can use the ps2ascii utility which comes with ghostscript. It executes the document with ghostscript, using a special device that outputs only ascii text. As ghostscript can handle pdfs too, ps2ascii works fine on them (although I did have some compatibility problems with some pdfs, depending on the generating program and the version of ghostscript).

    This doesn't work for word documents of course.

      OP, you may have some luck loading MS Word into (star|open)office, printing to pdf then chucking it at ps2ascii. As it is the exact same formating that is hardest for *office to get correct and ascii has little remmenant of these I guess you could have a lot of luck.

      update

      As ambrus points out below of course if you can read the word doc into *office then you can just export ASCII from there. Sorry, it has been a rather long day

      You may also want to trawl through a list of filters, I found this one which looks like it may have some tools that could help

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!

        If you can load a document to *office, why don't you save it straight as an ascii text or at least any other format that can be parsed easily?