Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Search Engine Theory

by httptech (Chaplain)
on Jun 06, 2000 at 03:22 UTC ( [id://16518]=perlquestion: print w/replies, xml ) Need Help??

httptech has asked for the wisdom of the Perl Monks concerning the following question:

As far as I can tell, there are two basic types of search engines for the usual "Search this Site" type of search. You can have an indexed search, where you might use perl to dissect the HTML files on your site in an indexing process, cross-referencing a list of words with documents:
foo: index.html,page1.html bar: page1.html baz: page1.html,page2.html
Then you can search the index file for each keyword and include/exclude documents based on boolean constructs. It's very fast, but you are limited to just using boolean-style constructs, like foo AND bar AND baz. There's no way to tell where foo, bar and baz are at in the document.

Or, you can loop through all the files each time, using perl and regexs to find your search terms, what I call "recurse-and-grep". It's slow and eats up CPU and HD time, but you can search for phrases, like foo bar baz.

My problem is; I want the speed of an indexed search, but I also want to be able to search for phrases, not just keywords. The big name search engines can do this, but all the perl/CGI search scripts I have found to date cannot do both.

I considered doing something along the lines of using an indexed search to narrow my query down to just the documents that contain all the words of the phrase in any order, then grepping those documents looking for the phrase, but this will have widely varying speed based on how many documents are returned. In the case that every document matched the individual search terms it would actually be slower than just using the recurse-and-grep method alone.

So, what's the secret?

Replies are listed 'Best First'.
Re: Search Engine Theory
by lhoward (Vicar) on Jun 06, 2000 at 05:12 UTC
    I think every programmer that does anything with HTML and building websites should read Philip and Alex's Guide to Web Publishing. Particularly see the chapter Publicizing Your Site for the ins-and-outs of how search engines work.

    Now for your particular question I would do an "index of words and phrases". There are a few ways to go about this, but the following is the approach I would take (I've used this technique when building sites before with quite good results). I should let you know that this technique is not suitable for indexing lots of pages unless you have a lot of DB space to spare, but for a small or medium sized site with some DB space to spare this works well (or if you have a good method of identifying the "interesting" text on a page and only adding it to the index)

    To build the "index"

    1. produce a "cleaned-up" version of the page. Drop all common words ('and', 'of', 'the', etc...), strip all punctuation and bring everything to the same case (either lower or upper, it doesn't matter as long as you're consistent).
    2. compute all 1, 2, 3....N word sub-phrases of this cleaned-up text, and insert them all in your searching DB refering to the page that the text came from. I find that a value of 5 for N is good. Pick N based on the maximal # of words you expect someone to search on in their "search phrase". Larger values of N will allow people to search for longer phrases, but will cause your search DB to be larger.

      To handle the re-ordering problem you can add a "sort words in subphrases" step, and as long as you do this same search phrase sorting in the search it will work fine. Also you can include "skipped word" subphrases, but this will tend to make the index large (trading off the DB size vs the ability to search better).

    example:
    Consider the following text:
    Or, you can loop through all the files each time, using perl and regexs to find your search terms, what I call "recurse-and-grep".
    after step 1 (clean-down) we have
    you can loop through all files each time using perl regexs find your search terms what call recurse grep
    Then you store all the subphrases of from 1..N words for storage in your DB:
    • you
    • you can
    • can
    • you can loop
    • loop
    • can loop
    • etc...
    n=1 is basically what you describe doing in your original post. Fortunatelly this scales linearlly for N i.e. n=2 is aprox twice the size of n=1, n=3 is three times the size of n=1, etc.... the "sort to handle reording" trick doesn't increase this size at all; however, doing "word skip subphrases" can cause this growth to get quite fast.
    Then when someone enters a search phrase, you use the same clean-up procedure above and then search your index for that text.

    If you want to get fancy when handling the "no matches" case you can compute the sub-phrases of the users search phrase and search on those until you get at-least one page with that phrase.

      Nice right up... however since I'd imagine that in normal use order probably isn't as much an issue, a sort could be run on the words so that you have even less entries in the database... then it is just a matter of sorting the search words as well. Also if you wanted to be able to return every combination of the words you only have to do one search. This also gives you the ability to search less combinations of subsets of the words if your initial database request returns nothing.
        Exactly! Storting (and searching by) the "sorted substring of words" uses no extra space and handles all permutations of the "substring of words" automagically. Avioding all the space waste of having to store all of their permutations.
Re: Search Engine Theory
by nardo (Friar) on Jun 06, 2000 at 13:20 UTC
    In order to see if a phrase is in a file, you need to record the position in each file of a particular word. For example:

    foo: index.html,5;index.html,18
    bar: index.html,6;page1.html,1
    baz: index.html,7

    foo is the fifth and eighteenth word in index.html, bar is the sixth word in index.html and the first in page1.html, and baz is the seventh word in index.html. As you can see, the phrase "foo bar baz" is in index.html. You could probably speed things up a bit by also storing the next word in the data (foo: index.html,5,bar;index.html,18,quux) so that you don't need to perform a lookup on bar for each occurance of foo.
      Clever. I like it. Would it not be a good idea to record the name of the file only once? Like:
      foo: index.html,5,18; bar: index.html,6;page1.html,1; baz: index.html,7;
      That would make the index smaller. Then with some trivial splitting, you end up with something like:
      $seq{'foo'}{'index.html'} = [5,18]; $seq{'bar'}{'index.html'} = [6]; $seq{'bar'}{'page1.html'} = [1]; $seq{'baz'}{'index.html'} = [7];
      Now I just need a clever algorithm to iterate over it all and figure out which document has foo bar baz in order. Hmmm.
        The following should work, but it's off the top of my head and hasn't been thought out, so it's possible there's a better algorithm and it's also possible there's a bug in the code, so be sure to give it a good once over to make sure i haven't screwed up somewhere.

        $seq{'foo'}{'index.html'} = [5,18]; $seq{'bar'}{'index.html'} = [6]; $seq{'bar'}{'page1.html'} = [1]; $seq{'baz'}{'index.html'} = [7]; @pages = &find_phrase("foo bar baz"); sub find_phrase { my @words = split(/\s+/, $_[0]); my $word = shift @words; my $pos; my $page; my %found; foreach $page (keys %{$seq{$word}}) { foreach $pos (@{$seq{$word}{$page}}) { if(&find_phrase_at($page, $pos + 1, @words)) { $found{$page} = 1; } } } return keys %found; } sub find_phrase_at { my $file = shift; my $position = shift; my @words = @_; my $word; if(@words == 0) { return 1; } $word = shift @words; if(grep($_ == $position, @{$seq{$word}{$file}})) { return &find_phrase_at($file, $position + 1, @words); } return 0; }
RE: Search Engine Theory
by JanneVee (Friar) on Jun 06, 2000 at 18:13 UTC
    The Secret is actually to use a fast DB. Like MySQL. You have to build an index of the words in one table and relate the documents to the keywords. It is just do a "dynamic" SQL select. Much faster than a grep cycle. Due to the index of the index! The grep thing goes through the whole index!
RE: Search Engine Theory
by turnstep (Parson) on Jun 07, 2000 at 01:38 UTC

    For a really good discussion of searching theory, check out (IIRC) Chapter 5 of the Mastering Perl Algorithms book. A *very* detailed and techincal discussion, but should give you a whole new perspective on how to write a good searching algorithm and/or program.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://16518]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-03-29 14:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found