Search term parsing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm building a search engine that interfaces to my content management system. I have two MLDBM indexes, one for a basic weighted term index and the other for a semi-hierarchal phrase index.

Now these are being generated ok but I'm in the process of building my search interface (which will be visible via the web). I wrote the term searcher first and that is running ok but now I am faced with shoe-horning the phrase indexer in.

A standard search query would be: term1 term2 "this is a phrase" term3 "oops

Where "oops is an unbalanced phrase so we treat it as a term and remove the " using regex.

This means that, as far as I can see the user has one search term field on the web form. I've been playing with a parser for sometime and I stopped staring out of the window blankly when I came across this snippet from Merlyn (et al).

while (m/\G(<[^>]*>|#[^#]*#|[^#<]+)/gc) 
{
    push @pieces, $1;
}
[download]

I've converted this to the following script (replacing < with a "):

use strict;
use warnings 'all';

# The phrase we are testing
$_ = '"this is a     phrase" +one -two .three, "another" phrase "unbal
+alanced bit, ""remove';
print $_,"\n";

my (@phrases,@terms);

# Grab the chunks and stick into our arrays
while(m/\G("[^"]*"|[^"]+|"[^"]*)/gc)
{
    my $p = $1;
    next unless defined($p);
    if($p =~ m/"$/)
    {
        push @phrases,cleanup('phrase',$p);
    }
    else
    {
        push @terms,cleanup('term',$p);
    }
}

# Display the phrases and terms
foreach (@phrases)
{
    print 'phrase: ',$_,"\n";
}

foreach (@terms)
{
    print 'term: \'',$_,"'\n";
}

#
# Sub cleanup
#
# Removes quotes and multiple spaces. In the case of a
# term it also removes all punctuation (other than a + or a -)
# and splits on spaces.
#

sub cleanup
{
    my $context = shift;
    return unless defined $context;
    
    if($context eq 'phrase')
    {
        $_[0] =~ s/"+//g;
        $_[0] =~ s/\s+/ /g;
        return $_[0];
    }
    else
    {
        $_[0] =~ s/"//g;
        $_[0] =~ s/[^\w\d\+\-]+/ /g;
        $_[0] =~ s/^\s+//g;    
        return split(/\s+/,$_[0]);
    }
}
[download]

Additionally, the cleanup function removes all non word characters and punctuation apart from a + or a -. These are used to keep or remove terms from our search (- being equivalent to 'not').

I'm only just beginning to understand how the regex works (as I built it on a guess and ran with it) but I wonder if I'm doing my term parsing the correct way? Also is there a faster way of doing this? This is quite important as I am running a stemmer to match these terms against the databases. For those who don't know what that means, I'm reversing the term, comparing against a list of known endings, removing the ending or preserving it and then moving to the next term.

Theres quite a lot going on here and so the faster the better :). If it is useful, I'll post the stemmer code to a separate node (perhaps a craft node?) so people can see what I'm doing.

Edit Masem 2002-02-19 - Added READMORE tag

Comment on Search term parsing Select or Download Code

Replies are listed 'Best First'.
Re: Search term parsing by merlyn (Sage) on Feb 11, 2002 at 18:08 UTC
Why not just use Text::Query? Unless you feel like reinventing a lot of existing code. -- Randal L. Schwartz, Perl hacker	[reply]