sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

This isn't specifically about Perl but since I'm sure some of you have worked with this, maybe you could help clear some things up.

I don't understand the logic behind site searches on how they work. When you type in a few words to search for, somehow the script will pull apart some pages and give you some results, right? How does it do that in such a timely manner?

My impression is for the search to work, it has to open each file (or node) and rip apart it's context THEN display the results. But if this were the case, how could it search all the nodes in a matter of seconds? Could it really open and read that many pages at one time?

It's really confusing to think of how Yahoo! pulls this off. They have millions of pages to search for but the results are still brought up in a matter of seconds. Can someone explain how searches are run?

Thank you, Wise Monks!



"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

sulfericacid

Replies are listed 'Best First'.
Re: Searches
by Zed_Lopez (Chaplain) on Dec 02, 2003 at 22:10 UTC

    Short answer: the pages are indexed in advance and these indices are stored in a database; the user's specified search terms are compared to the indices; the results are fairly speedy.

    Long answer (or excuse for lack thereof): web searching is a hard and complicated problem. Here's a decent introductory article. Do some web searching on the subject. You'll find a lot of stuff. And if you had a good and novel approach to making it more effective or efficient, Google would probably be very happy to pay you handsomely.

      Heh! Good segue into perl :)

      Update: What? You guys don't think that was a good segue into perl from what is essentially a non-perl-related node?

Re: Searches
by Zaxo (Archbishop) on Dec 02, 2003 at 22:04 UTC

    Non-Perl answer, take a look at gnu id-utils. It munches through a tree and indexes words in a dbm file. (Very useful set of programs for some purposes).

    After Compline,
    Zaxo

Re: Searches
by inman (Curate) on Dec 03, 2003 at 11:08 UTC
    This site contains a reasonable amount of useful background infromation on searching including many links to a variety of search tools. http://www.searchtools.com/index.html - Notice the link to Perl based solutions

    One of the more exotic search technologies is employed by Google and is known as PigeonRank. It employs a distributed multi-agent system to rank search results. The agents are widely available but do need to be trained!!

    inman

Re: Searches
by rinceWind (Monsignor) on Dec 03, 2003 at 11:44 UTC
    Just to add some more perl to this thread, check out the CPAN modules Search::InvertedIndex and DBIx::FullTextSearch. Modules such as these can make up the bones of a search mechanism for a wiki or a content management system, where the pages are stored in a relational database.

    --
    I'm Not Just Another Perl Hacker
Re: Searches
by sleepingsquirrel (Chaplain) on Dec 03, 2003 at 02:08 UTC
    #!/usr/bin/perl #web_search -- a program to simulate how sites index the 'Net. #invocation: <web_search *.pl> to index all the *.pl files #in the current directory. use strict; my %words; ### Index the files undef $/; #file slurp mode for my $name (@ARGV) { open F,"<$name" or die "couldn't open $name\n"; map { push( @{$words{$_}}, $name) } split(/\s+/, <F>); #print "$_ -> @{$words{$_}}\n" for keys %words; #pretty print } ### Ask for search keywords $/="\n"; #unslurp for user input print "\nEnter you list of search words: "; open I, "-"; while(<I>) { for (split) { if (defined($words{$_})) { print "$_ found in ".join(", ",@{$words{$_}})."\n"; }else { print "$_ not found\n"; } } }
Re: Searches
by Anonymous Monk on Dec 03, 2003 at 14:36 UTC