Hero Zzyzzx has asked for the wisdom of the Perl Monks concerning the following question:

I believe I have aged a few years trying to figure out the best way to do this.

For various reasons, I'm reduced to writing my own site search engine.

I've already made a stab at an indexer, and I'm happy with the results.

The thing that's bugging me is how to create a search query parser that allows for parenthesis, "and", "or", and "not" matching. After reading the Cookbook and this article on perl.com, I have some ideas but I'm losing it on the implementation.

Here's what I want to do:

  1. Step through the search query by word, from left to right. I've got code that does this nicely.
  2. Look the word up in the search index, retrieve the doc ids it's in into an array.
  3. If a word begins with "(" send everything up to the next ")" to a second level match, that returns an array to be added (in the case of an "or") or subtracted (in the case of an "and") to the match array
  4. Repeat until done.

The thing that's mostly eluding me is step 3- how to send everything from "(" to ")" somewhere.

The code I'm using to walk through the string word-by-word is:

foreach($query =~ /\b([A-Za-z'\d.@\/:-]+)\b/g){ #do stuff here }

And this works pretty well. Any help is appreciated, especially if you know some CPAN modules that would simplify this. I've played with Text::Balanced and it's partially what I need.

-Any sufficiently advanced technology is
indistinguishable from doubletalk.

Replies are listed 'Best First'.
Re: Search engine query parsing
by perrin (Chancellor) on Aug 16, 2001 at 02:38 UTC

      You know, I've just started to give DBIx::FullTextSearch a real chance, and I'm playing around with it. Part of the issue is that I'm indexing legal documents, and the lawyers want a high level of search functionality- including boolean operators, searches by date, category, author, etc. and I'm just afraid that DBIx::FullTextSearch isn't going to cut it.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

        Hmmm... if you need field-based searching, you could combine a text search like this with some simple field-based searching, maybe in MySQL or a DBM file (for indexing keys to documents). Or maybe you could do multiple searches for the nested conditions and combine the answers. DBIx::FullTextSearch does support a pretty fancy syntax, although not quite as fancy as what you're talking about.

        If you need to write your own parser, try Parse::RecDescent or Parse::Tokens.