astroboy has asked for the wisdom of the Perl Monks concerning the following question:

Hi All I need to provide search facilities for web sites, from the sites themselves. I have found a number of inverted index modules on CPAN that can do this, but although they’ll index text and html, they don’t do MS Word files - which is a requirement that I have. So I figure if I strip the text out of the proprietary formats then I’ll be able to come up with a viable solution. Unfortunately, because the sites will invariably be hosted on UNIX, I won’t be able to use Windows-specific Perl modules. Any ideas on alternatives?

Replies are listed 'Best First'.
Re: Inverted indexes and MS Word
by bart (Canon) on Dec 13, 2003 at 13:43 UTC
    You might try to use OLE::Storage/LAOLA to get at only the text. In particular, take a closer look at the lhalw script. Results aren't garanteed, but for an inverted index it might just be good enough.
Re: Inverted indexes and MS Word
by thpfft (Chaplain) on Dec 13, 2003 at 18:48 UTC

    I think antiword might be what you need.

Re: Inverted indexes and MS Word
by Anonymous Monk on Dec 13, 2003 at 03:10 UTC
    Yes. This question came up about 20 times in the past two weeks, so please use the search feature of this site, or super search to find the wisdom. Cheers