It seems you are getting what looks like junk because you are trying to read a binary file. A Word file is a stream of objects that need to be decoded. It also has things like endianness, hierarchy, and possibly old versions of the text all jumbled together but all is not lost!

Here are some results from my own research in the recent past. Note: I do not have experience with any of these, or with word2x (which also looks good). StarOffice might work well too, especially if you have a massive amount of memory, or if you can use just the converter part..

Why not try wvWare (http://www.wvWare.com/) which should support Word 2000 (Word version 9) files as well and is being used in KDE among other places. It comes with scripts that automate various conversions such as Word to HTML 4.0. I have not used it myself yet but it sounds good and is free (GPL) software.

There are some other pieces of software which you may want to look at in the future but probably not by Sunday..

lv, a multilingual file viewer (in case you are dealing with non-English encoding, you could drop your file on this after running it through wvWare).
http://www.ff.iij4u.or.jp/~nrt/lv/

If you want to access OLE streams (overkill I expect):

olemsword.pl, a program which uses Win32::OLE to read MS Word files (at least to version 8 I believe). Available in association with www.namazu.org, a Japanese search engine. Win32::OLE at CPAN is based on ActiveState Perl's Windows library.
But they use wvWare mainly, saying olemsword.pl is only for Windows platform (possibly not so any more)

Filters Web (http://arturo.directmail.org/filtersweb/) Ole libraries.

OLE::Storage (the perl4 version was called LAOLA)
Its "lhalw" utility doesn't fully support Word 8.
I have used the related excel converter which is not perfect.
Available on CPAN, or see http://user.cs.tu-berlin.de/~schwartz/perl/

Good luck and please tell us how it goes.

Matt


In reply to Re: uploading files in CGI by mattr
in thread uploading files in CGI by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.