in reply to Perfect Indexer & Search Engine

Let's say you're considering types of search index which assign a set of values for different categories to each document. In the simple case, the categories you're using to categorise documents are words (or perhaps stems ). Given a short document text like:

"Perl on Tuesday, Python on Wednesday, Rain on Thursday, Perl on Friday."

You might get a document index that looks something like

# normalised all the index values to be in the range 0..1 # removed "stop words" $docindex = { 'Perl' => 1, 'Python' => 0.5, 'Rain' => 0.5, 'Tuesday' => 0.5, 'Wednesday' => 0.5, 'Thursday' => 0.5, 'Friday' => 0.5 };

With one of these indexes for every document, stored somewhere, you'd have the kind of model that might be used, say in a vector space search engine. Once you've got indexes of this sort, there's no reason why you can't add keys representing categories other than the words within a document, such as "belonging to Course 11". Add keys to the per-document index that would never be words, but can be used internally to limit searches. For example:

$docindex = { 'Perl' => 1, 'Python' => 0.5, ... '~~Course11' => 1 };

But ... all that said, since you've got a meaningful file hierarchy already ( /$course/$week/$item)I'd strongly recommend you look into something like HTdig and using it's restrict and exclude parameters to control where in the site a search is conducted. Basically, HTDig can be set to create multiple separate search indexes (perhaps 1 per course?) or create one big index, but then limit per-search results by path. It's documentation should help.

HTH
ViceRaid

update: rephrased for clarity

Replies are listed 'Best First'.
Re: Re: Perfect Indexer & Search Engine
by YAFZ (Pilgrim) on Jun 17, 2003 at 13:39 UTC
    Well, I was considering HTdig but not sure about its database-integration capabilities. After your recommendations I'll concentrate on this software and see if it is up to my problem. Thanks for your comments.

      Sorry, I didn't understand your question as clearly as Zaxo. htDig's relational database integration capabilities are pretty much nil, AFAIK.

      Still, it might be easier to index the end-product - the rendered pages - using an existing product, rather than the database itself and XML/txt/HTML sources in a roll-your-own system. Then you wouldn't have to worry about reconstructing the URLs from the search results, and since you've already got category->url mapping, you can build a user search interface that allows limiting by categories by allowing search restriction by URL path.

      As an aside, it's also quite hard to do good free-text searches within an RDBMS - MySQL's FREETEXT indexes are pretty limited. On the site I'm working on at the moment, we've ditched a search system build round Oracle's ConText / Intermedia search tool in favour of an htDig system indexing the rendered pages within a CMS.

      cheers
      ViceRaid