in reply to Fulltext DB search: The Need for Speed

Well, this isn't really a Perl question, but more a question about of to organize your data. How Google does it is a secret, but I can tell you that Google isn't doing a full search against the entire web.

But I can speculate. My guess is that Google has a huge index. An index on words. If it fetches a (new) web page, it makes a list of all the words occurring in the page. For each word, it stores a pointer to said page in the index of words it keeps (including pointer(s) where in the document the word(s) are found). So, if you search for "the brown cat with a glass eye", it will toss out the common words 'the' and 'a' (and perhaps 'with' as well). For 'brown', 'cat', 'glass', and 'eye' (and perhaps 'with'), it gets the pointers to the pages the words are found in. For pages containing all words, you need to take the intersection of the different (sub)results.

Of course, in reality Google will do it much smarter, perhaps not just indexing on single words, but on word pairs or triples, or by using a multilevel index.

But the important point is, if you want to search on words, and you want to search fast, you got to index on words, and not use full text searches. And even if you want to only returns documents that have "the brown cat with a glass eye" right next to each other, it's a huge win if you limit your full text search to those documents that contain the words 'brown', 'cat', 'glass' and 'eye'.

Abigail

  • Comment on Re: Fulltext DB search: The Need for Speed

Replies are listed 'Best First'.
Re: Re: Fulltext DB search: The Need for Speed
by jest (Pilgrim) on Oct 27, 2003 at 16:08 UTC

    Thanks. I should clarify that the MySQL built-in fulltext search I am currently using is an index-based search; I'm not merely doing full table scans to match "WHERE column LIKE '%$match%'" or something like that. And these other solutions I've tried, or am considering trying, are likewise building up an index of individual words.