httptech has asked for the wisdom of the Perl Monks concerning the following question:
As far as I can tell, there are two basic types of search engines
for the usual "Search this Site" type of search. You can have
an indexed search, where you might use perl to dissect the HTML
files on your site in an indexing process, cross-referencing a list
of words with documents:
foo: index.html,page1.html
bar: page1.html
baz: page1.html,page2.html
Then you can search the index file for each keyword and
include/exclude documents based on boolean constructs.
It's very fast, but you are limited to just using boolean-style constructs, like foo AND bar AND baz.
There's no way to tell where foo, bar and baz are at in the document.
Or, you can loop through all the files each time, using perl
and regexs to find your search terms, what I call "recurse-and-grep".
It's slow and eats up CPU and HD time, but you can search for phrases,
like foo bar baz.
My problem is; I want the speed of an indexed search, but
I also want to be able to search for phrases, not just
keywords. The big name
search engines can do this, but all the perl/CGI search
scripts I have found to date cannot do both.
I considered doing something along the lines of using an
indexed search to narrow my query down to just the documents
that contain all the words of the phrase in any order, then
grepping those documents looking for the phrase, but this will
have widely varying speed based on how many documents are returned.
In the case that every document matched the individual search terms
it would actually be slower than just using the recurse-and-grep method
alone.
So, what's the secret?
Re: Search Engine Theory
by lhoward (Vicar) on Jun 06, 2000 at 05:12 UTC
|
I think every
programmer that does anything with HTML and
building websites should read
Philip and Alex's Guide to Web Publishing. Particularly see the chapter
Publicizing Your Site for the ins-and-outs of how search
engines work.
Now for your particular question I would do an "index of
words and phrases". There are a few ways to
go about this,
but the following is the approach I would take
(I've used this technique when
building sites before with quite good results). I should let
you know that this technique is not suitable for indexing
lots of pages unless you have a lot of DB space to spare,
but for a small or medium sized site with some DB space
to spare this works well (or if you have a good
method of identifying the "interesting" text on a page
and only adding it to the index)
To build the "index"
- produce a "cleaned-up" version of the page. Drop all
common words ('and', 'of', 'the', etc...), strip all punctuation
and bring everything to the same case (either lower or upper,
it doesn't matter as long as you're consistent).
- compute all 1, 2, 3....N word sub-phrases of this
cleaned-up text,
and insert them all in your searching DB refering to the
page that the text came from. I find that a value of 5 for
N is good. Pick
N based on the maximal # of words you expect someone to
search on in their "search phrase". Larger values of N
will allow people to search for longer phrases, but
will cause your search DB to be larger.
To handle the re-ordering problem you can add a "sort words in subphrases"
step, and as long as you do this same search
phrase sorting in the search
it will work fine. Also you can include "skipped word" subphrases,
but this will tend to make the index large (trading off
the DB size vs the ability to search better).
example:
Consider the following text:
Or, you can loop through all the files each time,
using perl and regexs to find your search terms,
what I call "recurse-and-grep".
after step 1 (clean-down) we have
you can loop through all files each time using perl
regexs find your search terms what call recurse grep
Then you store all the subphrases of from 1..N words
for storage in your DB:
- you
- you can
- can
- you can loop
- loop
- can loop
- etc...
n=1 is basically what you describe doing in your original
post. Fortunatelly this scales linearlly for N i.e. n=2 is
aprox twice the size of n=1, n=3 is three times the size
of n=1, etc.... the "sort to handle reording" trick doesn't
increase this size at all; however, doing "word skip subphrases"
can cause this growth to get quite fast.
Then when someone enters a search phrase, you use the
same clean-up procedure above and then search your
index for that text.
If you want to get fancy
when handling the
"no matches" case you can compute the sub-phrases of
the users
search phrase and search on those until you get at-least
one page with that phrase.
| [reply] [Watch: Dir/Any] |
|
Nice right up... however since I'd imagine that in normal use order probably isn't as much an issue, a sort could be run on the words so that you have even less entries in the database... then it is just a matter of sorting the search words as well.
Also if you wanted to be able to return every combination of the words you only have to do one search. This also gives you the ability to search less combinations of subsets of the words if your initial database request returns nothing.
| [reply] [Watch: Dir/Any] |
|
Exactly! Storting (and searching by) the
"sorted substring of words" uses no extra space
and handles all permutations of the
"substring of words" automagically.
Avioding all the space waste of having to store
all of their permutations.
| [reply] [Watch: Dir/Any] |
Re: Search Engine Theory
by nardo (Friar) on Jun 06, 2000 at 13:20 UTC
|
In order to see if a phrase is in a file, you need to record the position in each file of a particular word. For example:
foo: index.html,5;index.html,18
bar: index.html,6;page1.html,1
baz: index.html,7
foo is the fifth and eighteenth word in index.html, bar is the sixth word in index.html and the first in page1.html, and baz is the seventh word in index.html. As you can see, the phrase "foo bar baz" is in index.html. You could probably speed things up a bit by also storing the next word in the data (foo: index.html,5,bar;index.html,18,quux) so that you don't need to perform a lookup on bar for each occurance of foo. | [reply] [Watch: Dir/Any] |
|
Clever. I like it.
Would it not be a good idea to record the name of the file
only once? Like:
foo: index.html,5,18;
bar: index.html,6;page1.html,1;
baz: index.html,7;
That would make the index smaller. Then with some trivial
splitting, you end up with something like:
$seq{'foo'}{'index.html'} = [5,18];
$seq{'bar'}{'index.html'} = [6];
$seq{'bar'}{'page1.html'} = [1];
$seq{'baz'}{'index.html'} = [7];
Now I just need a clever algorithm to iterate over it all
and figure out which document has foo bar baz in
order. Hmmm.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
The following should work, but it's off the top of my head and hasn't been thought out, so it's possible there's a better algorithm and it's also possible there's a bug in the code, so be sure to give it a good once over to make sure i haven't screwed up somewhere.
$seq{'foo'}{'index.html'} = [5,18];
$seq{'bar'}{'index.html'} = [6];
$seq{'bar'}{'page1.html'} = [1];
$seq{'baz'}{'index.html'} = [7];
@pages = &find_phrase("foo bar baz");
sub find_phrase
{
my @words = split(/\s+/, $_[0]);
my $word = shift @words;
my $pos;
my $page;
my %found;
foreach $page (keys %{$seq{$word}})
{
foreach $pos (@{$seq{$word}{$page}})
{
if(&find_phrase_at($page, $pos + 1, @words))
{
$found{$page} = 1;
}
}
}
return keys %found;
}
sub find_phrase_at
{
my $file = shift;
my $position = shift;
my @words = @_;
my $word;
if(@words == 0)
{
return 1;
}
$word = shift @words;
if(grep($_ == $position, @{$seq{$word}{$file}}))
{
return &find_phrase_at($file, $position + 1, @words);
}
return 0;
}
| [reply] [Watch: Dir/Any] [d/l] |
RE: Search Engine Theory
by JanneVee (Friar) on Jun 06, 2000 at 18:13 UTC
|
The Secret is actually to use a fast DB. Like MySQL. You have to build an index of the words in one table and relate the documents to the keywords. It is just do a "dynamic" SQL select. Much faster than a grep cycle. Due to the index of the index! The grep thing goes through the whole index! | [reply] [Watch: Dir/Any] |
RE: Search Engine Theory
by turnstep (Parson) on Jun 07, 2000 at 01:38 UTC
|
For a really good discussion of searching theory, check out
(IIRC) Chapter 5 of the Mastering Perl
Algorithms book. A *very* detailed and techincal
discussion, but should give you a whole new perspective on
how to write a good searching algorithm and/or program.
| [reply] [Watch: Dir/Any] |
|
|