elbow has asked for the wisdom of the Perl Monks concerning the following question:

I'm building a search engine to use against my companies files. My first version was a straightforward use of Index Server with modifications to the asp script that came with it. My second version was a Perl CGI script that called the asp script to run Index Server and pass the details back to the Perl script. From there I could use Perl to manipulate the files as I wished (i.e. highlighting search words).

Unfortunately (but perhaps unsurprisingly) Index Server does not fully cover the search criteria I wish to use. The use of wildcard searches is required, in any format (*stuff, st*ff, and stuff*). I've written a 'brute force' search program that matches on regular expressions, but with over 6000 documents to search it takes a while.

My question is this: does anybody have any ideas on how to improve the speed? My code for the search is

# Take each word from the array of search words and apply the qr// ope +rator to prevent a recompile for each regex. # Output the words to a new array using map (faster than foreach). @search_array = map { qr/\b$_\b/i } @and_array; # For each file within the directory while ( defined( $File = readdir(DIR) ) ) { # Skip the directory and parent directory (read as . and ..) next if $File =~ /^\.\.?$/; # Open the file for regular expression match open ( FILE, "$File" ) || die "Cannot open file $File: $!"; # Read file into an array @file_contents = (); @file_contents = <FILE>; close FILE; $nomatch = "0"; # Loop to match each search word held in the array against each line i +n the file. WORD: for $word ( @search_array ) { @found = (); @found = grep /$word/i, @file_contents; if ( scalar @found == 0 ) { $nomatch = "1"; last WORD; } } if ( $nomatch == "0" ) { $x = $x + 1; print LIST "$x $File ".$TelonDir.$in{'scope'}."\\$File \n"; } # End of while defined $File closedir DIR; close LIST;

Please go easy on me - I've learn't Perl from scratch by myself over the last 8 months and I'm certainly not the most adept!

elbow

Edit by dws to clean up tags

Replies are listed 'Best First'.
Re: Speed up the search
by derby (Abbot) on May 07, 2002 at 13:10 UTC
    wow. you could write a book (and some have) on search technology. IMHO, there are no good perl search modules. (Please someone prove me wrong). Check out searchtools for a pretty comprehensive list of available apps/libraries. A lot of the products there cost ($$) and a lot of them focus on the spidering of information vice the indexing/searching but you should find it a good starting point. Just be prepared to spend a significant chunk of time integrating.

    The main problem with your approach is it will not scale well. It may work fine for your current doc set but add a few more thousand and it will become unbearably slow. Also doing all that regex work in realtime will become burdensome. Most approach this problem by indexing offline and then using those indexes for searching. You run the risk of stale searches if you have extremely dynamic docs but most people don't - so indexing on a aperiodic basis (weekly) will do the trick.

    An example of a perl library found at searchtools would be perlfect.

    -derby

    update: Thanks perrin. I'll look into Search::InvertedIndex. I've looked at DBIx::FullTextSearch before but didn't want the MySql overhead.

    update again: Just to clarify, I would really like to see soemthing like lucene in perl world.

    update yet again: perrin is right. I need to look at CPAN more closely. Besides the two mentiond below, WAIT is a perl/XS implementation of the once ubiquitous WAIS.

      We have just written an indexing search tool. contact r.talbot@staff.covcollege.ac.uk for more info.
Re: Speed up the search
by Stegalex (Chaplain) on May 07, 2002 at 13:10 UTC
    We use ht://Dig at my company. It's open source and it works well. I don't see a reason to roll your own search engine but best of luck if that's what you want to do.

    ~~~~~~~~~~~~~~~
    I like chicken.
Re: Speed up the search
by tachyon (Chancellor) on May 07, 2002 at 15:39 UTC

    The essence of efficient searching is to index your search documents (slow) and then search your index to find these docs (faster). You need to update your index of course to keep pace with changes to your docs. No matter how you cut it searching a large amount of data takes time. You save real time by avoiding unnecessary repetition (ie only re-index updated docs). You save user time by doing the drudge work in advance. The trick is to kid the user that it is fast by doing stuff when they are not looking!

    See the already mentioned links for more details. A relational database index is likey to scale better and work more reliably than a flat file index.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Speed up the search
by BUU (Prior) on May 07, 2002 at 12:40 UTC
    a) <code> tags are your friends..
    b) if you arent hell bent on reinventing the wheel, you might want to look at googles search technology that you can license out or some such.
      a) I put them in - honest!!
      b) Will have a look but not sure the company are interested in paying anything out! Thanks though.
        To close the <code> section, use the </code> keyword.
A reply falls below the community's threshold of quality. You may see it by logging in.