As far as I can tell, there are two basic types of search engines for the usual "Search this Site" type of search. You can have an indexed search, where you might use perl to dissect the HTML files on your site in an indexing process, cross-referencing a list of words with documents:
foo: index.html,page1.html bar: page1.html baz: page1.html,page2.html
Then you can search the index file for each keyword and include/exclude documents based on boolean constructs. It's very fast, but you are limited to just using boolean-style constructs, like foo AND bar AND baz. There's no way to tell where foo, bar and baz are at in the document.

Or, you can loop through all the files each time, using perl and regexs to find your search terms, what I call "recurse-and-grep". It's slow and eats up CPU and HD time, but you can search for phrases, like foo bar baz.

My problem is; I want the speed of an indexed search, but I also want to be able to search for phrases, not just keywords. The big name search engines can do this, but all the perl/CGI search scripts I have found to date cannot do both.

I considered doing something along the lines of using an indexed search to narrow my query down to just the documents that contain all the words of the phrase in any order, then grepping those documents looking for the phrase, but this will have widely varying speed based on how many documents are returned. In the case that every document matched the individual search terms it would actually be slower than just using the recurse-and-grep method alone.

So, what's the secret?


In reply to Search Engine Theory by httptech

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.