comment on

I'm building a search engine that interfaces to my content management system. I have two MLDBM indexes, one for a basic weighted term index and the other for a semi-hierarchal phrase index.

Now these are being generated ok but I'm in the process of building my search interface (which will be visible via the web). I wrote the term searcher first and that is running ok but now I am faced with shoe-horning the phrase indexer in.

A standard search query would be: term1 term2 "this is a phrase" term3 "oops

Where "oops is an unbalanced phrase so we treat it as a term and remove the " using regex.

This means that, as far as I can see the user has one search term field on the web form. I've been playing with a parser for sometime and I stopped staring out of the window blankly when I came across this snippet from Merlyn (et al).

while (m/\G(<[^>]*>|#[^#]*#|[^#<]+)/gc) 
{
    push @pieces, $1;
}
[download]

I've converted this to the following script (replacing < with a "):

use strict;
use warnings 'all';

# The phrase we are testing
$_ = '"this is a     phrase" +one -two .three, "another" phrase "unbal
+alanced bit, ""remove';
print $_,"\n";

my (@phrases,@terms);

# Grab the chunks and stick into our arrays
while(m/\G("[^"]*"|[^"]+|"[^"]*)/gc)
{
    my $p = $1;
    next unless defined($p);
    if($p =~ m/"$/)
    {
        push @phrases,cleanup('phrase',$p);
    }
    else
    {
        push @terms,cleanup('term',$p);
    }
}

# Display the phrases and terms
foreach (@phrases)
{
    print 'phrase: ',$_,"\n";
}

foreach (@terms)
{
    print 'term: \'',$_,"'\n";
}

#
# Sub cleanup
#
# Removes quotes and multiple spaces. In the case of a
# term it also removes all punctuation (other than a + or a -)
# and splits on spaces.
#

sub cleanup
{
    my $context = shift;
    return unless defined $context;
    
    if($context eq 'phrase')
    {
        $_[0] =~ s/"+//g;
        $_[0] =~ s/\s+/ /g;
        return $_[0];
    }
    else
    {
        $_[0] =~ s/"//g;
        $_[0] =~ s/[^\w\d\+\-]+/ /g;
        $_[0] =~ s/^\s+//g;    
        return split(/\s+/,$_[0]);
    }
}
[download]

Additionally, the cleanup function removes all non word characters and punctuation apart from a + or a -. These are used to keep or remove terms from our search (- being equivalent to 'not').

I'm only just beginning to understand how the regex works (as I built it on a guess and ran with it) but I wonder if I'm doing my term parsing the correct way? Also is there a faster way of doing this? This is quite important as I am running a stemmer to match these terms against the databases. For those who don't know what that means, I'm reversing the term, comparing against a list of known endings, removing the ending or preserving it and then moving to the next term.

Theres quite a lot going on here and so the faster the better :). If it is useful, I'll post the stemmer code to a separate node (perhaps a craft node?) so people can see what I'm doing.

Edit Masem 2002-02-19 - Added READMORE tag

In reply to Search term parsing by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.