RE: Re: Search Engine Theory

Clever. I like it. Would it not be a good idea to record the name of the file only once? Like:

foo: index.html,5,18;
bar: index.html,6;page1.html,1;
baz: index.html,7;
[download]

That would make the index smaller. Then with some trivial splitting, you end up with something like:

$seq{'foo'}{'index.html'} = [5,18];
$seq{'bar'}{'index.html'} = [6];
$seq{'bar'}{'page1.html'} = [1];
$seq{'baz'}{'index.html'} = [7];
[download]

Now I just need a clever algorithm to iterate over it all and figure out which document has foo bar baz in order. Hmmm.

Comment on RE: Re: Search Engine Theory Select or Download Code

Replies are listed 'Best First'.
RE: RE: Re: Search Engine Theory by nardo (Friar) on Jun 06, 2000 at 20:16 UTC
The following should work, but it's off the top of my head and hasn't been thought out, so it's possible there's a better algorithm and it's also possible there's a bug in the code, so be sure to give it a good once over to make sure i haven't screwed up somewhere. $seq{'foo'}{'index.html'} = [5,18]; $seq{'bar'}{'index.html'} = [6]; $seq{'bar'}{'page1.html'} = [1]; $seq{'baz'}{'index.html'} = [7]; @pages = &find_phrase("foo bar baz"); sub find_phrase { my @words = split(/\s+/, $_[0]); my $word = shift @words; my $pos; my $page; my %found; foreach $page (keys %{$seq{$word}}) { foreach $pos (@{$seq{$word}{$page}}) { if(&find_phrase_at($page, $pos + 1, @words)) { $found{$page} = 1; } } } return keys %found; } sub find_phrase_at { my $file = shift; my $position = shift; my @words = @_; my $word; if(@words == 0) { return 1; } $word = shift @words; if(grep($_ == $position, @{$seq{$word}{$file}})) { return &find_phrase_at($file, $position + 1, @words); } return 0; } [download]	[reply] [d/l]

Replies are listed 'Best First'.

RE: RE: Re: Search Engine Theory
by nardo (Friar) on Jun 06, 2000 at 20:16 UTC

$seq{'foo'}{'index.html'} = [5,18];
$seq{'bar'}{'index.html'} = [6];
$seq{'bar'}{'page1.html'} = [1];
$seq{'baz'}{'index.html'} = [7];

@pages = &find_phrase("foo bar baz");

sub find_phrase
 {
   my @words = split(/\s+/, $_[0]);
   my $word = shift @words;
   my $pos;
   my $page;
   my %found;

   foreach $page (keys %{$seq{$word}})
    {
      foreach $pos (@{$seq{$word}{$page}})
       {
         if(&find_phrase_at($page, $pos + 1, @words))
          {
            $found{$page} = 1;
          }
       }
    }
   return keys %found;
 }

sub find_phrase_at
 {
   my $file = shift;
   my $position = shift;
   my @words = @_;
   my $word;

   if(@words == 0)
    {
      return 1;
    }
   $word = shift @words;
   if(grep($_ == $position, @{$seq{$word}{$file}}))
    {
      return &find_phrase_at($file, $position + 1, @words);
    }
   return 0;
 }
[download]

[reply]
[d/l]