moseley has asked for the wisdom of the Perl Monks concerning the following question:

Ok, this is somewhat of a repost. And it is not a candidite for an easy answer. Maybe there's not answer, other than try to optimize in C (or Inline the code). But perhaps someone with good perl internal skills can help.

Frankly, I'm sure it's too much to ask from PerlMonks, too, but you never know. If this does look interesting to someone, I'll be happy to put together a some code that can be easily run as-is.

I'm a developer for the Swish-e search engine (http://swish-e.org). Swish is open source, and is always looking for help with development.

Swish includes a CGI/mod_perl script for displaying search results. The script's goal is to display the search terms highlighted just like google (well, better than google).

Here's a current running example:

http://swish-e.org/Discussion/search/swish.cgi

Context search term highlighing not a trivial problem solved by simple regular expression matching, especially when trying to highlight phrases.

Here's why: swish-e creates a reverse index. It parses documents, extracts out "words" and creates an index. On a search, swish finds the words or phrases you are searching for, and lists the doucments where they are found.

The reason "words" is in quotes there is because swish has its own idea about what makes up a word.

So to highlight terms in the source document (that were probably "hit" words as swish found them) the search script must actually (re)parse the document into "words" as swish indexes them, and be able to map those "words" back to the original text.

It's very processing intesive. Look at these numbers from profiling. Swish is returning 15 "hits" but a hit could be in the title or the body, so it's calling the highlighting code twice for each hit (hence 30 calls).

%Time ExclSec CumulS #Calls sec/call Csec/c Name 66.7 1.270 1.659 30 0.0423 0.0553 PhraseHighlight::highligh +t_text 11.5 0.220 0.220 30 0.0073 0.0073 PhraseHighlight::split_by +_wordchars 8.93 0.170 0.170 30 0.0057 0.0057 PhraseHighlight::build_hi +ghlighted_text 2.63 0.050 1.719 1 0.0499 1.7188 SwishQuery::run_swish 2.10 0.040 0.040 11 0.0036 0.0036 CGI::_compile 2.10 0.040 0.090 3 0.0133 0.0299 SwishSearch::BEGIN 0.53 0.010 0.010 1 0.0100 0.0100 Exporter::export 0.53 0.010 0.010 1 0.0100 0.0100 Config::BEGIN 0.53 0.010 0.010 95 0.0001 0.0001 SwishQuery::config 0.53 0.010 0.010 1 0.0100 0.0100 vars::BEGIN 0.53 0.010 0.010 1 0.0100 0.0100 Fh::BEGIN 0.53 0.010 0.030 3 0.0033 0.0100 CGI::BEGIN 0.53 0.010 1.777 1 0.0100 1.7770 SwishSearch::process_requ +est 0.53 0.010 1.728 1 0.0099 1.7285 SwishQuery::run_query
The actual query take no time at all. It's the highlighting that kills the speed. highlight_text() is basically the code shown below.

split_by_wordchars() is just a call to split to split up the input text by "wordchars" (more below). I doubt that can be optimized -- even in C. The only way would be to avoid splitting up all the source text and instead switch to a stream parser. But that's only 10% of the processing speed.

build_highlighted_text() is the process of joining the highlighted words together (with ...'s added). This is done by looking at what's flagged in @flags array in code below.

Now, back to the overview of swish. Swish splits the source text by "wordcharacters" which is the list of allowable chars in a word. Then "IgnoreFirstChars" and "IgnoreLastChars" are removed from each word. This allows a dot inside a word, but not at the end of a word, for example. Then words are checked to see if they are valid (too long, too many digits, and so on). Then they may be converted into another word for fuzzy searching (like soundex/metaphone, or stemmed). Then that resulting word is indexed.

When a search is done, the same basic processing is done on the query passed in. Swish makes this "Parsed" query available to the calling program (i.e. the perl search front-end), so you can see how the words were converted into the words that swish is using to search its index.

So, when you do a query to swish, you can get back not only the list of document it found, but also the "Parsed" query which is the original query converted into "swish words".

So, to highlight words in the source text (let's ignore the need to parse HTML and assume we are just working with plain text):

1) get the "Parsed" query from swish. Break it into "phrases" from longest phrase to shortest phrase (and a phrase can be one word).

2) split the source text into swish "words" (as swish does during indexing) and non-swish words.

3) Start at the first word in the document.

4) Start with the longest phrase, and look for a match. Note that you must skip any stopwords. Repeat for all "phrases" until a match is found. Bump the position in the source document and repeat until X number of matches have been found.

Ok, here's the code I'm using, and this is where I'm asking for help optimizing. Do note that the "split_by_wordcharacters" function (see profile output above) does take quite a bit of time. I don't expect to get much optimization out of that because it's probably already optimized. But it might help to use a stream split method instead of splitting the entire document.

This bit of code has the text split up like:

@words = split /$wc_regexp/, $$text_ref;
which splits into an array of "swish words" and non-swish-words. that's why everything is *2.

One important thing. This is highlighting code *in context*, so, like google, I find a word or phrase to highlight, and display a few words on both sides. The @flags array is used to flag what words need to be displayed in the context output.

WORD: while ( $word_pos * 2 < @words ) { PHRASE: foreach my $phrase ( @phrases ) { # is phrase is longer than what's left? next PHRASE if ($word_pos + @$phrase -1) * 2 > @words; my $end_pos = 0; # end offset of the current phrase # now compare all the words in the phrase my ( $begin, $word, $end ); for my $match_word ( @$phrase ) { my $cur_word = $words[ ($word_pos + $end_pos) * 2 &#0 +93;; # split word into ( begin_chars, swish_word, end_chars ) unless ( $cur_word =~ /$extract_regexp/ ) { warn "Failed to parse IgnoreFirst/Last from word '"; next PHRASE; } ( $begin, $word, $end ) = ( $1, $2, $3 ); # swish works only with lowercase words my $check_word = lc $word; if ( $end_pos && exists $self->{stopwords}{$check_word} + ) { $end_pos++; print STDERR " Found stopword '$check_word' in the mid +dle of phrase - * MATCH *\n"; # go on to check this match word with the next word. redo if ( $word_pos + $end_pos ) * 2 < @words; # No more words to match with, so go on to next pharse +. next PHRASE; } # We may be using a fuzzy search such as stemming or sound +ex if ( $stemmer_function ) { my $w = $stemmer_function->($check_word); $check_word = $w if $w; } # Now we are ready to compare the "swish word". # Note that swish allows wildcard (truncation) in a query # e.g. you can search for run* to get run, runs, runner, r +unning... if ( substr( $match_word, -1 ) eq '*' ) { next PHRASE if index( $check_word, substr($match_word, + 0, length( $match_word ) - 1) ) != 0; } else { next PHRASE if $check_word ne $match_word; } print STDERR " *** Word Matched '$check_word' *** \n" + if DEBUG_HIGHLIGHT; $end_pos++; } print STDERR " *** PHRASE MATCHED (word:$word_pos offset: +$end_pos) *** \n" if DEBUG_HIGHLIGHT; # We are currently at the end word, so it's easy to set highli +ght for the *last* word $end_pos--; if ( !$end_pos ) { # only one word in the "phrase" $words[$word_pos * 2] = "$begin$on_flag$word$off +_flag$end"; } else { $words[($word_pos + $end_pos) * 2 ] = "$begin$wo +rd$off_flag$end"; #Now, reload first word of match $words[$word_pos * 2] =~ /$extract_regexp/ or di +e "2 Why didn't '$words[$word_pos]' =~ /$extract_regexp/?"; # Strip ignorefirst and ignorelast ( $begin, $word, $end ) = ( $1, $2, $3 ); # probably shou +ld cache this! $words[$word_pos * 2] = "$begin$on_flag$word$end +"; } # Now, flag the words around to be shown my $start = ($word_pos - $Show_Words + 1) * 2; my $stop = ($word_pos + $end_pos + $Show_Words - 2) * 2; if ( $start < 0 ) { $stop = $stop - $start; $start = 0; } $stop = $#words if $stop > $#words; $flags[$_]++ for $start .. $stop; # Now, we only both with marking the first $occurences of matc +hes # All done, and mark where to stop looking if ( $occurrences-- <= 0 ) { $last = $end; last WORD; } # Now reset $word_pos to word following last matches word $word_pos += $end_pos; # continue will still be executed next WORD; } } continue { $word_pos ++; }
Finally when all that processing is done, I use the @flags array to join the words together into a single scalar. That's the build_highlighted_text() function in the profile above.

Clearly, this kind of processing is best done in Perl. But the speed is a problem. I can rewrite some in C, but I'd rather not. Kind of a nightmare in C, I'd imagine. So, what I'm really looking for is someone that knows perl internals to show me where I'm doing something stupid that's eating time.

Sorry for such a long post. It's an interesting problem begging for a clever solution.

Thanks!

Edit by tye to change PRE to CODE, remove TABLE tags, and add READMORE

Replies are listed 'Best First'.
Re: Context search term highlighting - Perl is too slow
by perrin (Chancellor) on Dec 20, 2001 at 01:44 UTC
    First, you can optimize the regex $cur_word =~ /$extract_regexp/ by using the /o modifier. Be careful with this under mod_perl; you'll need to use eval or regex objects there instead.

    But ultimately, my advice would be to change the Swish-e index so that it can tell you not only what document the word is in but where in the document it was found. Then you can avoid doing this expensive parsing at request time.

      Perrin writes:
      But ultimately, my advice would be to change the Swish-e index so that it can tell you not only what document the word is in but where in the document it was found. Then you can avoid doing this expensive parsing at request time.

      Perrin,

      That's really hard, though. First, swish keeps track of word position for phrase matches. But, all sorts of things will bump the position counter, special chars, some html tags, and so on. Trying to match swish-e's position data with what I could parse would be hard. It's hard enought matching up the text. So if swish told me to highlight word 243, it would be very lucky if I knew what that word was.

      The other problem is that you can imagine the volume of data that might be returned for a wildcard search like s*. Tens of thousand word positions for a few hundred results.

      But, probably my solution, if possible is to have swish store the source document, and with each word the character offset. Then for each word hit return the character offsets. Argh. I can see where phrases would be tough, too.

      Right about /o in the regexp. See my comments (and I guess confusion) in my example code...

      thanks,

        That's really hard, though.

        I figured it must be, or you would have done it already. However, by parsing these words at request time you are moving something that is intentionally done ahead (cached, basically) into the request handling. It makes sense that you would pay a performance penalty for that.

        First, swish keeps track of word position for phrase matches. But, all sorts of things will bump the position counter, special chars, some html tags, and so on.

        What I had in mind was keeping a character index into the original documents, not a a word index.

        Right about /o in the regexp. See my comments (and I guess confusion) in my example code...

        Sorry, I don't see it. If you need help with /o, there are some very good regex folks on here. I also liked the discussion in the Perl Cookbook about this.

Re: Context search term highlighting - Perl is too slow
by clintp (Curate) on Dec 20, 2001 at 00:31 UTC
    Offhand, I don't see anything that looks overtly slow in the code. I wish you'd have presented this as a function with clear arguments/return values so I could throw some Real Life data at it and see how it behaves.

    At times, there's loops nested three deep to do your processing. For me, that's an optimization red flag saying "you're probably going about this the wrong way." Is that necessary? Is there a way to approach this without the looping? (Again, I wish we had a function and some data to play with.)

      Can the nested loops be avoided? That's a good question.

      The problem is that I have to map between the words I'm comparing and the original "un-modified" text.

      Anyway, here's an example that runs.

      http://hank.org/modules/PhraseTest.pm

      (I think PerlMonk's wrapping code is too agressive in code blocks)

      > dprofpp -u -p PhraseTest.pm Total Elapsed Time = 0.506468 Seconds User Time = 0.508234 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 61.0 0.310 0.430 30 0.0103 0.0143 PhraseHighlight::highligh +t_text 15.7 0.080 0.080 30 0.0027 0.0027 PhraseHighlight::build_hi +ghlighted _text 5.90 0.030 0.030 30 0.0010 0.0010 PhraseHighlight::substitu +e_highlig ht 3.94 0.020 0.020 15 0.0013 0.0013 Text::Wrap::wrap 1.97 0.010 0.010 30 0.0003 0.0003 PhraseHighlight::join_wor +ds 1.97 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN 1.97 0.010 0.020 1 0.0100 0.0200 vars::BEGIN 1.97 0.010 0.010 2 0.0050 0.0050 PhraseHighlight::BEGIN 0.00 0.000 -0.000 2 0.0000 - vars::import 0.00 0.000 -0.000 3 0.0000 - Text::Tabs::BEGIN 0.00 0.000 -0.000 2 0.0000 - result::BEGIN 0.00 0.000 -0.000 4 0.0000 - constant::BEGIN 0.00 0.000 -0.000 1 0.0000 - warnings::register::impor +t 0.00 0.000 -0.000 2 0.0000 - warnings::register::mkMas +k 0.00 0.000 -0.000 1 0.0000 - strict::unimport
      Thanks,
Re: Context search term highlighting - Perl is too slow
by hossman (Prior) on Dec 20, 2001 at 02:28 UTC
    Context search term highlighing not a trivial problem solved by simple regular expression matching, especially when trying to highlight phrases.

    Here's why: swish-e creates a reverse index. It parses documents, extracts out "words" and creates an index. On a search, swish finds the words or phrases you are searching for, and lists the doucments where they are found.

    I won't pretend that I understood everything in your explanation about how SWISH-E builds it's index, or how exactly you currently do highlighting, but I do have a general suggestion for simplifying your highlighting process:

    As long as all you are doing is "prettying" up the output, then I don't think it's all that important that your highlighting logic EXACTLY match your indexing logic. Looping over each search $word and doing a s/$word/$beg_hl$word$end_hl/g may result in words being highlighted that didn't accutally contribute to the acctual relevance of the document -- but that isn't neccessarily a bad thing.

    If I were in your shoes, I would focus on the fact that you are doing two very different things:

    1. Index documents using a specific criteria for what words are relevant to that document.
    2. Provide brief info for each document in a search list that lets people determine by eye if a document is useful to them.
    In the case of that second one, you don't have to be exact -- just helpful.
      You are right. In fact the CGI/mod_perl script that comes with swish-e contains a few "highlighting" modules, so you can select speed vs. accuracy.

      Why I'm spending time on this is that it's actually kind of ugly when the highlight doesn't match up -- especially when searching phrases as words get highlighted that really are not related to the search.

      Just trying to have my cake and eat it too.

Re: Context search term highlighting - Perl is too slow
by dws (Chancellor) on Dec 20, 2001 at 05:08 UTC
    One common optimization technique is to apply an inexpensive "qualification" test to determine if a more expensive test is necessary. I'm wondering if that might be possible here.

    Rather than going through the overhead of splitting a chunk of source text into tokens, check first to see if any search terms are present by using a regexp. The idea is to construct the regexp string  "\b(?:word1|word2|...|wordn)\b" from the unique words in your search phrases or their stemmed forms. (You would need to expand '*' into something appropriate.) This regexp is then applied to each chunk of input text. If no match, the chunk of text gets appended directly to output steam. If you get a match, only then do you go through the overhead of breaking up the text into tokens and applying your matching algorithm.