in reply to Dealing with Word Compact HTML

One other alternative... Get OpenOffice. Open your Word documents using OpenOffice. Save them as html files. OpenOffice seems to do a much better job of keeping html files free from junk tags. I believe OpenOffice is also much easier to automate and it has a powerful and consistent API (haven't tried it, but judging from the docs), so you could do this all automatically from your perl program.

If you do decide to install it, keep in mind that OpenOffice will try to change your file associations for the office documents and it is quite a pain to get them back to the original state.

Replies are listed 'Best First'.
Re: Re: Dealing with Word Compact HTML
by qq (Hermit) on Apr 14, 2004 at 19:21 UTC

    OpenOffice has much less cruft in the html, but it still isn't very good (not xhtml, for starters).

    See this guide for details, and a solution.

    qq

      That is probably true. However, I seem to think that it would be much easier to extract the data from the resulting html in comparison with Microsoft Office, which was the original problem.