in reply to Search Engine Theory
Now for your particular question I would do an "index of words and phrases". There are a few ways to go about this, but the following is the approach I would take (I've used this technique when building sites before with quite good results). I should let you know that this technique is not suitable for indexing lots of pages unless you have a lot of DB space to spare, but for a small or medium sized site with some DB space to spare this works well (or if you have a good method of identifying the "interesting" text on a page and only adding it to the index)
To build the "index"
To handle the re-ordering problem you can add a "sort words in subphrases" step, and as long as you do this same search phrase sorting in the search it will work fine. Also you can include "skipped word" subphrases, but this will tend to make the index large (trading off the DB size vs the ability to search better).
Consider the following text:Then when someone enters a search phrase, you use the same clean-up procedure above and then search your index for that text.Or, you can loop through all the files each time, using perl and regexs to find your search terms, what I call "recurse-and-grep".after step 1 (clean-down) we haveyou can loop through all files each time using perl regexs find your search terms what call recurse grepThen you store all the subphrases of from 1..N words for storage in your DB:n=1 is basically what you describe doing in your original post. Fortunatelly this scales linearlly for N i.e. n=2 is aprox twice the size of n=1, n=3 is three times the size of n=1, etc.... the "sort to handle reording" trick doesn't increase this size at all; however, doing "word skip subphrases" can cause this growth to get quite fast.
- you
- you can
- can
- you can loop
- loop
- can loop
- etc...
If you want to get fancy when handling the "no matches" case you can compute the sub-phrases of the users search phrase and search on those until you get at-least one page with that phrase.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
RE: Re: Search Engine Theory
by Anonymous Monk on Jun 06, 2000 at 17:55 UTC | |
by lhoward (Vicar) on Jun 06, 2000 at 18:01 UTC |