Perfect Indexer & Search Engine

YAFZ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perfect Indexer & Search Engine by ViceRaid (Chaplain) on Jun 17, 2003 at 11:53 UTC
Let's say you're considering types of search index which assign a set of values for different categories to each document. In the simple case, the categories you're using to categorise documents are words (or perhaps stems ). Given a short document text like: "Perl on Tuesday, Python on Wednesday, Rain on Thursday, Perl on Friday." You might get a document index that looks something like `# normalised all the index values to be in the range 0..1 # removed "stop words" $docindex = { 'Perl' => 1, 'Python' => 0.5, 'Rain' => 0.5, 'Tuesday' => 0.5, 'Wednesday' => 0.5, 'Thursday' => 0.5, 'Friday' => 0.5 };` [download] With one of these indexes for every document, stored somewhere, you'd have the kind of model that might be used, say in a vector space search engine. Once you've got indexes of this sort, there's no reason why you can't add keys representing categories other than the words within a document, such as "belonging to Course 11". Add keys to the per-document index that would never be words, but can be used internally to limit searches. For example: `$docindex = { 'Perl' => 1, 'Python' => 0.5, ... '~~Course11' => 1 };` [download] But ... all that said, since you've got a meaningful file hierarchy already ( `/$course/$week/$item`)I'd strongly recommend you look into something like HTdig and using it's `restrict` and `exclude` parameters to control where in the site a search is conducted. Basically, HTDig can be set to create multiple separate search indexes (perhaps 1 per course?) or create one big index, but then limit per-search results by path. It's documentation should help. HTH ViceRaid update: rephrased for clarity	[reply] [d/l] [select]
Re: Re: Perfect Indexer & Search Engine by YAFZ (Pilgrim) on Jun 17, 2003 at 13:39 UTC
Well, I was considering HTdig but not sure about its database-integration capabilities. After your recommendations I'll concentrate on this software and see if it is up to my problem. Thanks for your comments.	[reply]
Re: Re: Re: Perfect Indexer & Search Engine by ViceRaid (Chaplain) on Jun 17, 2003 at 14:47 UTC
Sorry, I didn't understand your question as clearly as Zaxo. htDig's relational database integration capabilities are pretty much nil, AFAIK. Still, it might be easier to index the end-product - the rendered pages - using an existing product, rather than the database itself and XML/txt/HTML sources in a roll-your-own system. Then you wouldn't have to worry about reconstructing the URLs from the search results, and since you've already got category->url mapping, you can build a user search interface that allows limiting by categories by allowing search restriction by URL path. As an aside, it's also quite hard to do good free-text searches within an RDBMS - MySQL's FREETEXT indexes are pretty limited. On the site I'm working on at the moment, we've ditched a search system build round Oracle's ConText / Intermedia search tool in favour of an htDig system indexing the rendered pages within a CMS. cheers ViceRaid	[reply]
Re: Perfect Indexer & Search Engine by Zaxo (Archbishop) on Jun 17, 2003 at 12:43 UTC
Perfection is elusive :-) If I understand your question, you want to know how to construct paths from db entries chosen from a cgi query. That is just a matter of building up a string. Select the db rows that match your query, according to your rules, and build the paths from the results. From your example data, it looks like the path is built from a db record as `"$course/$week/$id.$ext"`. Is the Type associated to the particular url? If the url is significant to the search, consider making a directory for each, and putting a `DirectoryIndex /cgi-bin/searchscript.pl` line in each directory's .htaccess file. the searchscript.pl file can grab the url it was called under. I'd like to see your whole design, what you show here seems slightly clunky. After Compline, Zaxo	[reply]
Re: Re: Perfect Indexer & Search Engine by YAFZ (Pilgrim) on Jun 17, 2003 at 14:00 UTC
Solution without headaches (after it's implemented, of course) is perfect (until it causes new headaches, of course, well that's what `scalability´ is for, isn't it ;-). Yes, the Type is associated with some specific URL. The system knows which template (read script, special actions, etc.) to use on the content according to this Type information. I'm sorry for a clunky description of my design :) I tried to be as clear as possible but that was the best I could compose at the time of writing. As you got it correctly my problem can be described as `knowing how to construct paths from db entries chosen from a search engine query´. And after considering the words of monks (including yours) I think I'll evaluate HTdig and see what I can do.	[reply]
Re: Perfect Indexer & Search Engine by Maclir (Curate) on Jun 17, 2003 at 12:44 UTC
You may want to look into Swish-E. It has some very powerful search results rewriting capabilities, plus a ready made perl front end.	[reply]
Re: Re: Perfect Indexer & Search Engine by YAFZ (Pilgrim) on Jun 17, 2003 at 14:53 UTC
I've just read an article about Swish-E, this one looks like a little nice indexer and searcher (especially being able to index and search man pages is a great feature) but I'm not sure if it can handle thousands of files which sum up to more than hundreds of MB of data (also deleting, modifying files and reindexing performance issues, etc.)	[reply]
Re: Perfect Indexer & Search Engine by belg4mit (Prior) on Jun 17, 2003 at 16:06 UTC
This means that the Indexer & Search system must take that into account, it is not a simple `WORD -> THIS_FILE´ structure but something that needs more transformations according to rules that I´ll provide to system. What's the point then, eh? Do you enjoy duplicating code? Let the search engine do the work, just use a search engine that can do HTTP indexing instead of local filesystem, maybe http://www.perlfect.com/freescripts/search/ `-- I'm not belgian but I play one on TV.`	[reply]
Re: Re: Perfect Indexer & Search Engine by YAFZ (Pilgrim) on Jun 18, 2003 at 09:58 UTC
Thanks for pointing me this compact search engine written in Perl. I'll take a look at. It would be great if I found a detailed comparison report about HTdig, SWISH-E, Perlfect Search, etc.	[reply]

ID	Type	Course	Week
1345	1	10	1
6544	5	11	7
1346	5	10	1