I'm building a search engine that interfaces to my content management system. I have two MLDBM indexes, one for a basic weighted term index and the other for a semi-hierarchal phrase index.

Now these are being generated ok but I'm in the process of building my search interface (which will be visible via the web). I wrote the term searcher first and that is running ok but now I am faced with shoe-horning the phrase indexer in.


A standard search query would be: term1 term2 "this is a phrase" term3 "oops

Where "oops is an unbalanced phrase so we treat it as a term and remove the " using regex.

This means that, as far as I can see the user has one search term field on the web form. I've been playing with a parser for sometime and I stopped staring out of the window blankly when I came across this snippet from Merlyn (et al).
while (m/\G(<[^>]*>|#[^#]*#|[^#<]+)/gc) { push @pieces, $1; }

I've converted this to the following script (replacing < with a "):
use strict; use warnings 'all'; # The phrase we are testing $_ = '"this is a phrase" +one -two .three, "another" phrase "unbal +alanced bit, ""remove'; print $_,"\n"; my (@phrases,@terms); # Grab the chunks and stick into our arrays while(m/\G("[^"]*"|[^"]+|"[^"]*)/gc) { my $p = $1; next unless defined($p); if($p =~ m/"$/) { push @phrases,cleanup('phrase',$p); } else { push @terms,cleanup('term',$p); } } # Display the phrases and terms foreach (@phrases) { print 'phrase: ',$_,"\n"; } foreach (@terms) { print 'term: \'',$_,"'\n"; } # # Sub cleanup # # Removes quotes and multiple spaces. In the case of a # term it also removes all punctuation (other than a + or a -) # and splits on spaces. # sub cleanup { my $context = shift; return unless defined $context; if($context eq 'phrase') { $_[0] =~ s/"+//g; $_[0] =~ s/\s+/ /g; return $_[0]; } else { $_[0] =~ s/"//g; $_[0] =~ s/[^\w\d\+\-]+/ /g; $_[0] =~ s/^\s+//g; return split(/\s+/,$_[0]); } }

Additionally, the cleanup function removes all non word characters and punctuation apart from a + or a -. These are used to keep or remove terms from our search (- being equivalent to 'not').

I'm only just beginning to understand how the regex works (as I built it on a guess and ran with it) but I wonder if I'm doing my term parsing the correct way? Also is there a faster way of doing this? This is quite important as I am running a stemmer to match these terms against the databases. For those who don't know what that means, I'm reversing the term, comparing against a list of known endings, removing the ending or preserving it and then moving to the next term.

Theres quite a lot going on here and so the faster the better :). If it is useful, I'll post the stemmer code to a separate node (perhaps a craft node?) so people can see what I'm doing.

Edit Masem 2002-02-19 - Added READMORE tag


In reply to Search term parsing by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.