in reply to parsing word on non-windows platform

As discussed before, a "good" way to start is by running your MS Word files through antiword. That will create a plaintext version, which you might parse.

HTH

--
b10m
  • Comment on Re: parsing word on non-windows platform

Replies are listed 'Best First'.
m$ word -> ps -> pdf
by g00n (Hermit) on Jan 26, 2004 at 01:50 UTC
    available for many platforms, gnu, src and binary - " ...Antiword is able to convert Word documents to plain text, to PostScript and to XML/DocBook ...

    thanks. this is just the tool I need. It is a shorter path than doc->html-ps->pdf. while the tool has no perl bindings (to my knowledge) I can just cron a perl (python) script to convert a directory of doc files to pdf via ps.

    update: matts (axkit) just left a message on my use.perl journal reporting that he recommends using HTMLDOC for converting html to pdf. It also has an Axkit plugin. I checked out the man and faq. GNU licensing, perl bindings and end user support available.

Re: Re: parsing word on non-windows platform
by MCS (Monk) on Jan 26, 2004 at 17:17 UTC

    Thanks, this seems to be just what I needed. Now if only I knew how to make a perl module to wrap it.

      Take a look at SWISH::Filter, part of the incredible search engine SWISH-E.

      From the page above:

      SWISH::Filter provides a unified way to convert documents into a type that swish-e can index.

      Hope this helps,
      mobiGeek.