comment on

Ok, this is somewhat of a repost. And it is not a candidite for an easy answer. Maybe there's not answer, other than try to optimize in C (or Inline the code). But perhaps someone with good perl internal skills can help.

Frankly, I'm sure it's too much to ask from PerlMonks, too, but you never know. If this does look interesting to someone, I'll be happy to put together a some code that can be easily run as-is.

I'm a developer for the Swish-e search engine (http://swish-e.org). Swish is open source, and is always looking for help with development.

Swish includes a CGI/mod_perl script for displaying search results. The script's goal is to display the search terms highlighted just like google (well, better than google).

Here's a current running example:

http://swish-e.org/Discussion/search/swish.cgi

Context search term highlighing not a trivial problem solved by simple regular expression matching, especially when trying to highlight phrases.

Here's why: swish-e creates a reverse index. It parses documents, extracts out "words" and creates an index. On a search, swish finds the words or phrases you are searching for, and lists the doucments where they are found.

The reason "words" is in quotes there is because swish has its own idea about what makes up a word.

So to highlight terms in the source document (that were probably "hit" words as swish found them) the search script must actually (re)parse the document into "words" as swish indexes them, and be able to map those "words" back to the original text.

It's very processing intesive. Look at these numbers from profiling. Swish is returning 15 "hits" but a hit could be in the title or the body, so it's calling the highlighting code twice for each hit (hence 30 calls).

%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 66.7   1.270  1.659     30   0.0423 0.0553  PhraseHighlight::highligh
+t_text
 11.5   0.220  0.220     30   0.0073 0.0073  PhraseHighlight::split_by
+_wordchars
 8.93   0.170  0.170     30   0.0057 0.0057  PhraseHighlight::build_hi
+ghlighted_text
 2.63   0.050  1.719      1   0.0499 1.7188  SwishQuery::run_swish
 2.10   0.040  0.040     11   0.0036 0.0036  CGI::_compile
 2.10   0.040  0.090      3   0.0133 0.0299  SwishSearch::BEGIN
 0.53   0.010  0.010      1   0.0100 0.0100  Exporter::export
 0.53   0.010  0.010      1   0.0100 0.0100  Config::BEGIN
 0.53   0.010  0.010     95   0.0001 0.0001  SwishQuery::config
 0.53   0.010  0.010      1   0.0100 0.0100  vars::BEGIN
 0.53   0.010  0.010      1   0.0100 0.0100  Fh::BEGIN
 0.53   0.010  0.030      3   0.0033 0.0100  CGI::BEGIN
 0.53   0.010  1.777      1   0.0100 1.7770  SwishSearch::process_requ
+est
 0.53   0.010  1.728      1   0.0099 1.7285  SwishQuery::run_query
[download]

The actual query take no time at all. It's the highlighting that kills the speed. highlight_text() is basically the code shown below.

split_by_wordchars() is just a call to split to split up the input text by "wordchars" (more below). I doubt that can be optimized -- even in C. The only way would be to avoid splitting up all the source text and instead switch to a stream parser. But that's only 10% of the processing speed.

build_highlighted_text() is the process of joining the highlighted words together (with ...'s added). This is done by looking at what's flagged in @flags array in code below.

Now, back to the overview of swish. Swish splits the source text by "wordcharacters" which is the list of allowable chars in a word. Then "IgnoreFirstChars" and "IgnoreLastChars" are removed from each word. This allows a dot inside a word, but not at the end of a word, for example. Then words are checked to see if they are valid (too long, too many digits, and so on). Then they may be converted into another word for fuzzy searching (like soundex/metaphone, or stemmed). Then that resulting word is indexed.

When a search is done, the same basic processing is done on the query passed in. Swish makes this "Parsed" query available to the calling program (i.e. the perl search front-end), so you can see how the words were converted into the words that swish is using to search its index.

So, when you do a query to swish, you can get back not only the list of document it found, but also the "Parsed" query which is the original query converted into "swish words".

So, to highlight words in the source text (let's ignore the need to parse HTML and assume we are just working with plain text):

1) get the "Parsed" query from swish. Break it into "phrases" from longest phrase to shortest phrase (and a phrase can be one word).

2) split the source text into swish "words" (as swish does during indexing) and non-swish words.

3) Start at the first word in the document.

4) Start with the longest phrase, and look for a match. Note that you must skip any stopwords. Repeat for all "phrases" until a match is found. Bump the position in the source document and repeat until X number of matches have been found.

Ok, here's the code I'm using, and this is where I'm asking for help optimizing. Do note that the "split_by_wordcharacters" function (see profile output above) does take quite a bit of time. I don't expect to get much optimization out of that because it's probably already optimized. But it might help to use a stream split method instead of splitting the entire document.

This bit of code has the text split up like:

    @words = split /$wc_regexp/, $$text_ref;
[download]

which splits into an array of "swish words" and non-swish-words. that's why everything is *2.

One important thing. This is highlighting code *in context*, so, like google, I find a word or phrase to highlight, and display a few words on both sides. The @flags array is used to flag what words need to be displayed in the context output.

WORD:
while ( $word_pos * 2 &lt; @words ) {

    PHRASE:
    foreach my $phrase ( @phrases ) {

        # is phrase is longer than what's left?
        next PHRASE if ($word_pos + @$phrase -1) * 2 &gt; @words;

        my $end_pos = 0;  # end offset of the current phrase

        # now compare all the words in the phrase

        my ( $begin, $word, $end );
        
        for my $match_word ( @$phrase ) {

            my $cur_word = $words&#091; ($word_pos + $end_pos) * 2 &#0
+93;;

            # split word into ( begin_chars, swish_word, end_chars )

            unless ( $cur_word =~ /$extract_regexp/ ) {
                warn  "Failed to parse IgnoreFirst/Last from word '";
                next PHRASE;
            }

            ( $begin, $word, $end ) = ( $1, $2, $3 );

            # swish works only with lowercase words
            my $check_word = lc $word;

            if ( $end_pos && exists $self-&gt;{stopwords}{$check_word}
+ ) {
                $end_pos++;
                print STDERR " Found stopword '$check_word' in the mid
+dle of phrase - * MATCH *\n";

                # go on to check this match word with the next word.
                redo if  ( $word_pos + $end_pos ) * 2 &lt; @words;  

                # No more words to match with, so go on to next pharse
+.
                next PHRASE;
            }

            # We may be using a fuzzy search such as stemming or sound
+ex
            if ( $stemmer_function ) {
                my $w = $stemmer_function-&gt;($check_word);
                $check_word = $w if $w;
            }

            # Now we are ready to compare the "swish word".
            # Note that swish allows wildcard (truncation) in a query
            # e.g. you can search for run* to get run, runs, runner, r
+unning...

            if ( substr( $match_word, -1 ) eq '*' ) {
                next PHRASE if index( $check_word, substr($match_word,
+ 0, length( $match_word ) - 1) ) != 0;

            } else {
                next PHRASE if $check_word ne $match_word;
            }


            print STDERR "      *** Word Matched '$check_word' *** \n"
+ if DEBUG_HIGHLIGHT;
            $end_pos++;  
        }

        print STDERR "      *** PHRASE MATCHED (word:$word_pos offset:
+$end_pos) *** \n" if DEBUG_HIGHLIGHT;


        # We are currently at the end word, so it's easy to set highli
+ght for the *last* word

        $end_pos--;

        if ( !$end_pos ) { # only one word in the "phrase"
        
            $words&#091;$word_pos * 2&#093; = "$begin$on_flag$word$off
+_flag$end";

        } else {
            $words&#091;($word_pos + $end_pos) * 2 &#093; = "$begin$wo
+rd$off_flag$end";

            #Now, reload first word of match
            $words&#091;$word_pos * 2&#093; =~ /$extract_regexp/ or di
+e "2 Why didn't '$words&#091;$word_pos&#093;' =~ /$extract_regexp/?";
            # Strip ignorefirst and ignorelast
            ( $begin, $word, $end ) = ( $1, $2, $3 );  # probably shou
+ld cache this!
            $words&#091;$word_pos * 2&#093; = "$begin$on_flag$word$end
+";
        }


        # Now, flag the words around to be shown
        my $start = ($word_pos - $Show_Words + 1) * 2;
        my $stop   = ($word_pos + $end_pos + $Show_Words - 2) * 2;
        if ( $start &lt; 0 ) {
            $stop = $stop - $start;
            $start = 0;
        }
        
        $stop = $#words if $stop &gt; $#words;

        $flags&#091;$_&#093;++ for $start .. $stop;


        # Now, we only both with marking the first $occurences of matc
+hes
        # All done, and mark where to stop looking
        if ( $occurrences-- &lt;= 0 ) {
            $last = $end;
            last WORD;
        }


        # Now reset $word_pos to word following last matches word
        $word_pos += $end_pos; # continue will still be executed
        next WORD;
    }
} continue {
    $word_pos ++;
}
[download]

Finally when all that processing is done, I use the @flags array to join the words together into a single scalar. That's the build_highlighted_text() function in the profile above.

Clearly, this kind of processing is best done in Perl. But the speed is a problem. I can rewrite some in C, but I'd rather not. Kind of a nightmare in C, I'd imagine. So, what I'm really looking for is someone that knows perl internals to show me where I'm doing something stupid that's eating time.

Sorry for such a long post. It's an interesting problem begging for a clever solution.

Thanks!

Edit by tye to change PRE to CODE, remove TABLE tags, and add READMORE

In reply to Context search term highlighting - Perl is too slow by moseley

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.