iguane has asked for the wisdom of the Perl Monks concerning the following question:

I search to create or work on a script which could convert MSWORD file to TEXT file (or HTML file). For me diposition is not really a problem. All that must be in perl language and turn on LINUX (without any connexion on WINDOWS ).
Someone could help me ?

Replies are listed 'Best First'.
Re: MSWORD TO TEXT
by ChOas (Curate) on Jan 31, 2001 at 15:10 UTC
    Hey!

    I found this for ya:

    LAOLA
    is a collection of documentations and perl programs dealing with binary file formats of Windows program documents. LAOLA is giving access to the raw document streams of any program using "structured storage" technology to save its documents. ELSER is dealing especially with these streams as they are present in Word 6 and Word 7 documents.

    You can find it here

    GreetZ!,
      ChOas

    print "profeth still\n" if /bird|devil/;
      Unfortunately it seems to require Perl 4, and only supports up to Word 7. There is a repository of file format details at www.wotsit.org and from there I once upon a time found my way to an in progress open source windows document converter. But I didn't stay long...

      ____________________
      Jeremy
      I didn't believe in evil until I dated it.

        Hey!
        Completely right, but it also mentions its successor OLE::Storage
        (available at your local CPAN), which uses Perl5, and does more...

        GreetZ!,
          ChOas

        print "profeth still\n" if /bird|devil/;
Re: MSWORD TO TEXT
by Trinary (Pilgrim) on Jan 31, 2001 at 21:13 UTC
    Did a couple searches, came up with this...the consensus seems to be that there aren't any Word .doc parsers, and your only hope is to use Win32::OLE, which apparently dosen't suit your needs. OLE::Storage just might do the trick...check it out.

    I could've sworn there was a command-line utility to do this conversion that would work, but I'm unable to find it right now...will search around some more.

    Trinary

      Check out the wm library; a rather nifty suite of command-line Word conversion tools.

          --k.


        For this part, i just write a part of code to transform WORD to TEXTE by using OLE::win32. The limit of the system is to use on WINDOWS MACHINE. But it works correctly
      I could've sworn there was a command-line utility to do this conversion that would work, but I'm unable to find it right now...will search around some more.

      Here at work we got word2x installed which works on Word6 documents (according to the manpage) but since I got no word documents at all I can't test it. Maybe have a look at http://word2x.alcom.co.uk. The README points to http://www.kfa-juelich.de/isr/1/texconv.html "for a list of other converters". As one can guess from the URL (kfa-juelich) it is a TeX-site ...

      I could swear I once ran across a word-to-ascii-converter, but can't remember name or place, sorry.

      Regards
      Stefan K

      $dom = "skamphausen.de"; ## May The Open Source Be With You! $Mail = "mail@$dom; $Url = "http://www.$dom";
Re: (Zigster) MSWORD TO TEXT
by zigster (Hermit) on Apr 12, 2001 at 19:11 UTC
    I use the UNIX command 'strings' it works fine and dandy with most word docs I have come across. The op is a little ruff but in most cases I can read the document. It all depends how clean you want the output.
    --

    Zigster

      Zigster,
      All I can say about strings is WOW! That works perfectly on Word 2k, WordPerfect 8, and Excel 2k files. Combined with pdftotext you have a nearly complete solution for extracting text from common user docs, which I'm doing for a search engine for a web-based document management site. Just goes to show that if there's something you want to do on Unix/Linux, chances are the tool is already sitting on your hard drive.

        Glad to know it worked for you, I would be very interested in seeing the result when you have completed it. As a full on UNIX head working in a MS world a complete toolset for converting MS docs to ASCII would be of great interest to me. Please msg me when/if you complete the tools.

        Cheers
        --

        Zigster

Re: MSWORD TO TEXT
by buckaduck (Chaplain) on Apr 12, 2001 at 19:04 UTC
    I'm pretty happy with the freeware program catdoc. It doesn't handle anything fancy like OLE objects, but it does a fine job extracting plain text from a plain Word document, including the Office97 format.

    If nothing else, the link above will point you to other good resources.

    buckaduck