I would look at
swish-e. It's an extemely fast and flexible tool for indexing and searching various kinds of documents (html, xml, text, pdf, doc, etc) and has a nice Perl interface.
-- More people are killed every year by pigs than by sharks, which shows you how good we are at evaluating risk. -- Bruce Schneier