Re: Script generated site index db

I did this recently for a site that uses the template toolkit. I used HTML::Tokeparser to parse the HTML and grab the text to index and a combination of MLDBM and Storable to build the search index. You also can use this or HTML::LinkExtor to extract the links from your pages. Just don't parse them out by hand (please).

I used stemming to reduce the size of the content and saved the relevant info about a page (the meta info) into a separate DBM cache. I did this so I could display the results with some of the page meta info (though on reflection this may not have been necessary).

If you don't already know, Stemming reduces words to a shorter form based on a set of rules. Thus trains, trainers and training could all be considered to be 'train'. I used Paice-Husk stemming although you could consider using Lingua::Stem which uses Porters algorithm.

The search index used weighted term based indexing.

If you would rather just get one pre-built then go to Perlfect as thats pretty good. I didn't use it because it didn't use Stemming (that I could see) and it didn't do phrase searching, which mine does.

I mention all of this because your search index is not at all optimised and some pre-processing would speed things up.

I would recommend this even though you are wanting to go about fetching stuff via a robot.

HTH

Simon

Comment on Re: Script generated site index db

Replies are listed 'Best First'.
Re: Re: Script generated site index db by S_Shrum (Pilgrim) on Mar 19, 2002 at 10:38 UTC
Stemming, eh? Hmmm...interesting concept. I'll research it. Overall, the total size of my current content file (even if I add in all the html template pages) is just over a meg. This generates over 100 pages of content at my site currently. If the content size was, say, 10 times greater, I would agree with you as searching a 10 meg file would take time. I think for the first version I will probably not stem the content...or at least not until I have a better grasp on the specifics of it. Also, by leaving things in this way, anybody using whatever db engine they wanted to use could do searches on the data, including phrase searching say like thru DBI, DBD::AnyData, and SQL::Statement. ;D Thanks for that bit of info about stemming...I will look into it. ====================== Sean Shrum http://www.shrum.net	[reply]
Re: Re: Re: Script generated site index db by rob_au (Abbot) on Jun 25, 2002 at 10:46 UTC
I have posted a node enquiring into stemming algorithms in Perl here, in particular, the Porter algorithm employed in Lingua::Stem, which may be of interest to you.	[reply]