in reply to Search Engine Theory

In order to see if a phrase is in a file, you need to record the position in each file of a particular word. For example:

foo: index.html,5;index.html,18
bar: index.html,6;page1.html,1
baz: index.html,7

foo is the fifth and eighteenth word in index.html, bar is the sixth word in index.html and the first in page1.html, and baz is the seventh word in index.html. As you can see, the phrase "foo bar baz" is in index.html. You could probably speed things up a bit by also storing the next word in the data (foo: index.html,5,bar;index.html,18,quux) so that you don't need to perform a lookup on bar for each occurance of foo.

Replies are listed 'Best First'.
RE: Re: Search Engine Theory
by httptech (Chaplain) on Jun 06, 2000 at 15:56 UTC
    Clever. I like it. Would it not be a good idea to record the name of the file only once? Like:
    foo: index.html,5,18; bar: index.html,6;page1.html,1; baz: index.html,7;
    That would make the index smaller. Then with some trivial splitting, you end up with something like:
    $seq{'foo'}{'index.html'} = [5,18]; $seq{'bar'}{'index.html'} = [6]; $seq{'bar'}{'page1.html'} = [1]; $seq{'baz'}{'index.html'} = [7];
    Now I just need a clever algorithm to iterate over it all and figure out which document has foo bar baz in order. Hmmm.
      The following should work, but it's off the top of my head and hasn't been thought out, so it's possible there's a better algorithm and it's also possible there's a bug in the code, so be sure to give it a good once over to make sure i haven't screwed up somewhere.

      $seq{'foo'}{'index.html'} = [5,18]; $seq{'bar'}{'index.html'} = [6]; $seq{'bar'}{'page1.html'} = [1]; $seq{'baz'}{'index.html'} = [7]; @pages = &find_phrase("foo bar baz"); sub find_phrase { my @words = split(/\s+/, $_[0]); my $word = shift @words; my $pos; my $page; my %found; foreach $page (keys %{$seq{$word}}) { foreach $pos (@{$seq{$word}{$page}}) { if(&find_phrase_at($page, $pos + 1, @words)) { $found{$page} = 1; } } } return keys %found; } sub find_phrase_at { my $file = shift; my $position = shift; my @words = @_; my $word; if(@words == 0) { return 1; } $word = shift @words; if(grep($_ == $position, @{$seq{$word}{$file}})) { return &find_phrase_at($file, $position + 1, @words); } return 0; }