Re: Search Engine Theory

I think every programmer that does anything with HTML and building websites should read Philip and Alex's Guide to Web Publishing. Particularly see the chapter Publicizing Your Site for the ins-and-outs of how search engines work.

Now for your particular question I would do an "index of words and phrases". There are a few ways to go about this, but the following is the approach I would take (I've used this technique when building sites before with quite good results). I should let you know that this technique is not suitable for indexing lots of pages unless you have a lot of DB space to spare, but for a small or medium sized site with some DB space to spare this works well (or if you have a good method of identifying the "interesting" text on a page and only adding it to the index)

To build the "index"

produce a "cleaned-up" version of the page. Drop all common words ('and', 'of', 'the', etc...), strip all punctuation and bring everything to the same case (either lower or upper, it doesn't matter as long as you're consistent).
compute all 1, 2, 3....N word sub-phrases of this cleaned-up text, and insert them all in your searching DB refering to the page that the text came from. I find that a value of 5 for N is good. Pick N based on the maximal # of words you expect someone to search on in their "search phrase". Larger values of N will allow people to search for longer phrases, but will cause your search DB to be larger.
To handle the re-ordering problem you can add a "sort words in subphrases" step, and as long as you do this same search phrase sorting in the search it will work fine. Also you can include "skipped word" subphrases, but this will tend to make the index large (trading off the DB size vs the ability to search better).

example:

Consider the following text:
Or, you can loop through all the files each time, using perl and regexs to find your search terms, what I call "recurse-and-grep".
after step 1 (clean-down) we have
you can loop through all files each time using perl regexs find your search terms what call recurse grep
Then you store all the subphrases of from 1..N words for storage in your DB:

you
you can
can
you can loop
loop
can loop
etc...
n=1 is basically what you describe doing in your original post. Fortunatelly this scales linearlly for N i.e. n=2 is aprox twice the size of n=1, n=3 is three times the size of n=1, etc.... the "sort to handle reording" trick doesn't increase this size at all; however, doing "word skip subphrases" can cause this growth to get quite fast.

Then when someone enters a search phrase, you use the same clean-up procedure above and then search your index for that text.

If you want to get fancy when handling the "no matches" case you can compute the sub-phrases of the users search phrase and search on those until you get at-least one page with that phrase.

Comment on Re: Search Engine Theory

Replies are listed 'Best First'.
RE: Re: Search Engine Theory by Anonymous Monk on Jun 06, 2000 at 17:55 UTC
Nice right up... however since I'd imagine that in normal use order probably isn't as much an issue, a sort could be run on the words so that you have even less entries in the database... then it is just a matter of sorting the search words as well. Also if you wanted to be able to return every combination of the words you only have to do one search. This also gives you the ability to search less combinations of subsets of the words if your initial database request returns nothing.	[reply]
RE: RE: Re: Search Engine Theory by lhoward (Vicar) on Jun 06, 2000 at 18:01 UTC
Exactly! Storting (and searching by) the "sorted substring of words" uses no extra space and handles all permutations of the "substring of words" automagically. Avioding all the space waste of having to store all of their permutations.	[reply]