I did this recently for a site that uses the
template toolkit. I used HTML::Tokeparser to parse the HTML and grab the text to index and a combination of MLDBM and Storable to build the search index. You also can use this or HTML::LinkExtor to extract the links from your pages. Just don't parse them out by hand (please).
I used stemming to reduce the size of the content and saved the relevant info about a page (the meta info) into a separate DBM cache. I did this so I could display the results with some of the page meta info (though on reflection this may not have been necessary).
If you don't already know, Stemming reduces words to a shorter form based on a set of rules. Thus trains, trainers and training could all be considered to be 'train'. I used Paice-Husk stemming although you could consider using Lingua::Stem which uses Porters algorithm.
The search index used weighted term based indexing.
If you would rather just get one pre-built then go to
Perlfect as thats pretty good. I didn't use it because it didn't use Stemming (that I could see) and it didn't do phrase searching, which mine does.
I mention all of this because your search index is not at all optimised and some pre-processing would speed things up.
I would recommend this even though you are wanting to go about fetching stuff via a robot.
HTH
Simon