MCS has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a little program to parse recipe files for personal use and then putting the results in a database. I have it working on regular text files and now I would like to conquer the recipes my uncle typed up in word. Unfortunately, I don't have a windows machine to use the Win32::OLE modules.

Is there a way to get the text from a word file in a UN*X environment in perl? (I have both linux and OS X) I searched CPAN but seemed to only come up with windows solutions.

Replies are listed 'Best First'.
Re: parsing word on non-windows platform
by b10m (Vicar) on Jan 25, 2004 at 22:00 UTC

    As discussed before, a "good" way to start is by running your MS Word files through antiword. That will create a plaintext version, which you might parse.

    HTH

    --
    b10m
      available for many platforms, gnu, src and binary - " ...Antiword is able to convert Word documents to plain text, to PostScript and to XML/DocBook ...

      thanks. this is just the tool I need. It is a shorter path than doc->html-ps->pdf. while the tool has no perl bindings (to my knowledge) I can just cron a perl (python) script to convert a directory of doc files to pdf via ps.

      update: matts (axkit) just left a message on my use.perl journal reporting that he recommends using HTMLDOC for converting html to pdf. It also has an Axkit plugin. I checked out the man and faq. GNU licensing, perl bindings and end user support available.

      Thanks, this seems to be just what I needed. Now if only I knew how to make a perl module to wrap it.

        Take a look at SWISH::Filter, part of the incredible search engine SWISH-E.

        From the page above:

        SWISH::Filter provides a unified way to convert documents into a type that swish-e can index.

        Hope this helps,
        mobiGeek.

Re: parsing word on non-windows platform
by neuroball (Pilgrim) on Jan 25, 2004 at 21:58 UTC

    Hi MCS, we have already a node where we discussed the conversion from Word to PDF files. You might want to take a look at it and just replace 'PDF' with 'Textfile'. I.e. the OpenOffice solutions should be what you were searching for.

    /oliver/

Re: parsing word on non-windows platform
by dominix (Deacon) on Jan 25, 2004 at 22:15 UTC
    there is a library wv ware that do this on *nix, you may want to make a perl Module that wrap it. :-)
    --
    dominix
Re: parsing word on non-windows platform
by punchcard_don (Beadle) on Jan 26, 2004 at 14:45 UTC
    Sometimes low-tech just as effective and a lot simpler.

    Just open the Word files in Word, hit "Save as...", and opt to save as .txt

    Reload the files on the server and parse them.

      yes but when you have many word documents, this starts to become tedious. "Laziness, Impatience and Hubris" make a good programmer ;-)

        Had (apparently mistakenly) imagined that "recipes my uncle typed up" meant a reasonable number of documents. How many recipes can one uncle type? At ~20-seconds per document, open-save-close, that's 180 recipes in an hour, far inferior to the time to develop, test and implement a server-side solution. Probably inferior to the time to post a question and read silly replies! :-)
Re: parsing word on non-windows platform
by Willard B. Trophy (Hermit) on Jan 26, 2004 at 14:35 UTC
    For very quick parsing, this usually works for me:
      strings file.doc | fmt -s
    
    Yes, you do get some extra guff, but if you are looking for particular words, this will pick them out.

    --
    bowling trophy thieves, die!