Re: Perfect Indexer & Search Engine

Let's say you're considering types of search index which assign a set of values for different categories to each document. In the simple case, the categories you're using to categorise documents are words (or perhaps stems ). Given a short document text like:

"Perl on Tuesday, Python on Wednesday, Rain on Thursday, Perl on Friday."

You might get a document index that looks something like

# normalised all the index values to be in the range 0..1
# removed "stop words"
$docindex = {
    'Perl' => 1,
    'Python' => 0.5,
    'Rain' => 0.5,
    'Tuesday' => 0.5,
    'Wednesday' => 0.5,
    'Thursday' => 0.5,
    'Friday' => 0.5 
};
[download]

With one of these indexes for every document, stored somewhere, you'd have the kind of model that might be used, say in a vector space search engine. Once you've got indexes of this sort, there's no reason why you can't add keys representing categories other than the words within a document, such as "belonging to Course 11". Add keys to the per-document index that would never be words, but can be used internally to limit searches. For example:

$docindex = {
    'Perl' => 1,
    'Python' => 0.5,
    ...
    '~~Course11' => 1 };
[download]

But ... all that said, since you've got a meaningful file hierarchy already ( /$course/$week/$item)I'd strongly recommend you look into something like HTdig and using it's restrict and exclude parameters to control where in the site a search is conducted. Basically, HTDig can be set to create multiple separate search indexes (perhaps 1 per course?) or create one big index, but then limit per-search results by path. It's documentation should help.

HTH
ViceRaid

update: rephrased for clarity

Comment on Re: Perfect Indexer & Search Engine Select or Download Code

Replies are listed 'Best First'.
Re: Re: Perfect Indexer & Search Engine by YAFZ (Pilgrim) on Jun 17, 2003 at 13:39 UTC
Well, I was considering HTdig but not sure about its database-integration capabilities. After your recommendations I'll concentrate on this software and see if it is up to my problem. Thanks for your comments.	[reply]
Re: Re: Re: Perfect Indexer & Search Engine by ViceRaid (Chaplain) on Jun 17, 2003 at 14:47 UTC
Sorry, I didn't understand your question as clearly as Zaxo. htDig's relational database integration capabilities are pretty much nil, AFAIK. Still, it might be easier to index the end-product - the rendered pages - using an existing product, rather than the database itself and XML/txt/HTML sources in a roll-your-own system. Then you wouldn't have to worry about reconstructing the URLs from the search results, and since you've already got category->url mapping, you can build a user search interface that allows limiting by categories by allowing search restriction by URL path. As an aside, it's also quite hard to do good free-text searches within an RDBMS - MySQL's FREETEXT indexes are pretty limited. On the site I'm working on at the moment, we've ditched a search system build round Oracle's ConText / Intermedia search tool in favour of an htDig system indexing the rendered pages within a CMS. cheers ViceRaid	[reply]