Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I wrote this all-perl search engine a few months ago one weekend. It's simplistic, but not simple. Basically it uses word-depth directories to store index files, and an index header for relatively quick access of the indexed pages. With a one-word search i benchmarked to about 7520 quieries per second. This is fast, especially since i do not use a database like MySQL. The only problem is that the speed seems to almost fall in half or at least by a third with each added keyword to search for. A 6 word search gives about 3420 queries per second. I was wondering if any of my fellow Monks had worked on search engines in the past, and would be willing to give me a few tips on how to handle multi-word searches and just searches in general? the source code is at http://windstone.darktech.org/~psypete/search_2.pl and although it may be messy it works.
Also, i can see that if my index files were bigger, i might also see a speed impact in searching when the index files are huuge. Of course i could have an index per word, but really, the size of the site indexed would determine the size of the index.
  • Comment on All-Perl search engine having speed issues

Replies are listed 'Best First'.
Re: All-Perl search engine having speed issues
by perrin (Chancellor) on Nov 19, 2001 at 22:14 UTC
    Take a look at Search::InvertedIndex. You could also try some of the C engines with Perl interfaces, like Swish, htdig, or glimpse.
      could you explain briefly what an inverted index is, and maybe how i would implement one? i am very interested.
        perrin kindly provided a link, and the best thing to do would be to get the module and read the docs. You're trying to index a flat-file database, and its just not scaling well. Search::InvertedIndex uses either MySQL or DB_File on the backend, and if you don't want to use MySQL, you should get DB_File. An 'inverted index' search just happens to be exactly what you're doing. Your main database is indexed by some id value, and its easy to look things up by the index, but you don't want to look things up by that id value, you want to look up those id values by some keywords. That's what Search::InvertedIndex does, and since the work has been done for you, you're advised to take advantage of it (and you can always look at the code if you're curious as to how its implemented).
(ichimunki) Re: All-Perl search engine having speed issues
by ichimunki (Priest) on Nov 19, 2001 at 23:42 UTC
    First, some suggestions unrelated to the benchmark issue:

    Use strict (in code over 1 line long, it will save you countless hours finding little bugs-- it will also enforce some good scope parameters more carefully, it looks like you're having some "fun" with that piece, too. That is, you shouldn't have to undef a list that is about to go out of scope.)

    Use CGI.pm (it may be a little heavy, but it provides good form handling)

    Use taint mode (you never know what's going to sneak in on a POST).

    Your loops will be more readable if you put an obvious list into the definition-- rather than using the C-like syntax. I mean, we can all figure it out, but that takes brain time on the reader end, where it is least desirable.

    Have you tested your split against strings containing more than one \W character in a row? Two spaces in between keywords is going to slow your search down for no apparent reason (and may even cause other problems).

    Finally, with respect to your algorithm, you can't get away from the performance hit the way you have this built. sysseek may be very efficient, but doing it twice will take twice as long (depending on where in the file the words fall). It looks like you're already using your filesystem to the best advantage, by sorting/indexing the search files by initial letter groupings, etc. but again, if you do something twice it takes twice as long.
      The old "use strict, CGI.pm and taint" reply... you forgot "warnings" ichimunki :)

      Tiago
        Someone had to say it! ;)

        At least I commented on something other than just those old saws. And I did remember -w, but only a while after I made the post... thanks for reminding us!
Re: All-Perl search engine having speed issues
by cLive ;-) (Prior) on Nov 19, 2001 at 22:38 UTC
    "With a one-word search i benchmarked to about 7520 queries per second."

    "...the speed seems to almost fall in half or at least by a third with each added keyword to search for."

    "A 6 word search gives about 3420 queries per second."

    The mathametician within me is itching to ask...

    "How does 7520 *((2/3)^6) come anywhere near 3420? Wouldn't that be nearer 660?"

    just my .09 .05 .02

    cLive ;-)

      Can't feel too bad about nit picking a nit.....

      "each added keyword" would imply (2/3)^5 wouldn't it? ;-)

      -Blake

      ok, so i can't count, but those are my benchmark results (the queries i showeded).