in reply to Bring Back the Old Supersearch
Not that I was in love with the old SuperSearch, but the new one seems of very limited use. Could you describe what the basic problem is in terms of amount of data to be searched, number of records, and number of searches per minute?
I'm not convinced Mysql has such an incredible text search mechanism.. Perhaps using more Perl or something else? I have had great results with htdig on spidered content files, maybe better results than most since I didn't lose sleep over a last security hole that was found recently.. my mod_perl wrapper was suitably paranoid.
Might I suggest that text to be searched is saved in another database designed solely for text searching? At the very least, it will not impact mysql at all. It also will be based on first learning which words are in each page (not depending on regexes) and using inverted indices. Synonyms, homonyms, misspellings, and fuzzy weighting of these algorithms are possible, and the redesigned engine would output only a certain number of results at a time.
A very straightforward hack using the htdig system might be to periodically output new nodes as files to disk, with some embedded fields for node id/title/author. For example it can search mail header fields in mailing lists. Or maybe the extra fields are looked up through a separate b-tree. Then the htdig database would be updated with those files, and the files are erased. Your mod_perl code slurps up the results and builds a search page the way you like it using the tag data.
Though I'm sure you've banged at this for a while, I just feel there are other solutions to the problem, TMTOWTDI. I'd be willing to do it. Anyway, you can try a boolean search on a gigabyte of data (60 sites) with word stemming here. Though a perl-only solution may still be totally doable. My system (I call it EyeLatitude) is meant to allow various search engines to be plugged into the back of it, all bound up in perl happiness. I'm selling it for significant bucks, but free to the monks if you want the code.
|
|---|