I'm building a search engine that interfaces to my content management system. I have two MLDBM indexes, one for a basic weighted term index and the other for a semi-hierarchal phrase index.
Now these are being generated ok but I'm in the process of building my search interface (which will be visible via the web). I wrote the term searcher first and that is running ok but now I am faced with shoe-horning the phrase indexer in.
A standard search query would be:
term1 term2 "this is a phrase" term3 "oops
Where "oops is an unbalanced phrase so we treat it as a term and remove the " using regex.
This means that, as far as I can see the user has one search term field on the web form. I've been playing with a
parser for sometime and I stopped staring out of the window blankly when I came across this snippet from
Merlyn (et al).
while (m/\G(<[^>]*>|#[^#]*#|[^#<]+)/gc)
{
push @pieces, $1;
}
I've converted this to the following script (replacing < with a "):
use strict;
use warnings 'all';
# The phrase we are testing
$_ = '"this is a phrase" +one -two .three, "another" phrase "unbal
+alanced bit, ""remove';
print $_,"\n";
my (@phrases,@terms);
# Grab the chunks and stick into our arrays
while(m/\G("[^"]*"|[^"]+|"[^"]*)/gc)
{
my $p = $1;
next unless defined($p);
if($p =~ m/"$/)
{
push @phrases,cleanup('phrase',$p);
}
else
{
push @terms,cleanup('term',$p);
}
}
# Display the phrases and terms
foreach (@phrases)
{
print 'phrase: ',$_,"\n";
}
foreach (@terms)
{
print 'term: \'',$_,"'\n";
}
#
# Sub cleanup
#
# Removes quotes and multiple spaces. In the case of a
# term it also removes all punctuation (other than a + or a -)
# and splits on spaces.
#
sub cleanup
{
my $context = shift;
return unless defined $context;
if($context eq 'phrase')
{
$_[0] =~ s/"+//g;
$_[0] =~ s/\s+/ /g;
return $_[0];
}
else
{
$_[0] =~ s/"//g;
$_[0] =~ s/[^\w\d\+\-]+/ /g;
$_[0] =~ s/^\s+//g;
return split(/\s+/,$_[0]);
}
}
Additionally, the cleanup function removes all non word characters and punctuation apart from a + or a -. These are used to keep or remove terms from our search (- being equivalent to 'not').
I'm only just beginning to understand how the regex works (as I built it on a guess and ran with it) but I wonder if I'm doing my term parsing the correct way? Also is there a faster way of doing this? This is quite important as I am running a stemmer to match these terms against the databases. For those who don't know what that means, I'm reversing the term, comparing against a list of known endings, removing the ending or preserving it and then moving to the next term.
Theres quite a lot going on here and so the faster the better :). If it is useful, I'll post the stemmer code to a separate node (perhaps a craft node?) so people can see what I'm doing.
Edit Masem 2002-02-19 - Added READMORE tag