moseley has asked for the wisdom of the Perl Monks concerning the following question:
I'm using the swish-e search engine and I'm using regular expressions to highlight search terms. I must spilt the source document into individual words and then match each word against the word(s) in the query. This can be a slow process.
wordchars - defines what characters can be in a word
ignorefirst - what characters to ignore at the start of a word
ignorelast - and what to ignore at the end of a word
For example,
wordchars = [a-z#], ignorefirst = [#]
Then if a document contained the string " foo#bar " swish would first split the string into "foo#" and "#bar" based on wordchars, then the ignorefirst would strip the "#" and swish would index two words "foo#" and "bar".
So you can see that highlighting the terms is more than just using:
Now, swish is nice when searching and does half the work. If you enter a query swish will convert the query into "swish words" and dispaly them. So, if you query for "foo#bar" swish will returns[\b($word)\b][<b>$1</b>]g;
Parsed words: foo# bar".
Onto the code.
The design is
A) create three compiled regular expressions for 1) splitting the source text into "swish words" and non-swish words, 2) an expression to extract the word to match out of this new "swish word" by removing ignorefirst and ignorelast characters, and 3) an expression to match this word against the words in the original query,
B) extract out the words from the source text,
C) compare against the regexp built from the "parsed words" query, and if a match then mark words on both sides of the matched word for display,
D) join together all marked words for output.
Note that there's also wildcard matching in swish so the query is "foo#bar*" ends up as "foo#" and "bar*" where any word that starts with the three letters bar will match. Words may also be stemmed with a word stemmer (as seen below).
Again, this code works, ok, but I just would like to make it run faster. It runs under mod_perl so it can't /o all the regular expressions. I hope this isn't too unclear:
Tear away!Highlighting works by this call: $text = highlight( \$text, $stemmer_function, set_match_regexp( \%heads ) +); First, this builds the three regular expressions: ------------------------------------------------- sub set_match_regexp { my $header = shift; my ($query, $wc, $ignoref, $ignorel ) = @{$header}{'parsed words', qw/wordcharacters ignorefirstchar ignorelastchar/}; $wc = quotemeta $wc; # build regexp to match against the query words # If a trailing "*" is found then do a wild card search. my $match_string = join '|', map { substr( $_, -1, 1 ) eq '*' ? quotemeta( substr( $_, 0, -1) ) . "[$wc]*?" : quotemeta } # get rid of operator words and characters grep { ! /^(and|or|not|["()=])$/oi } split /\s+/, $query; # These are to remove and capture the ignorefirst # and ingnorelast chars for ( $ignoref, $ignorel ) { if ( $_ ) { $_ = quotemeta; $_ = "([$_]*)"; } else { $_ = '()'; } } # Avoid /i matching... $wc .= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'; # Warning: dependent on tolower +used while indexing # return the three compiled expressions # 1) used to split source into swish/non-swish words # 2) used to strip ignorefirst/ignorelast # 3) used to match final word against query word(s) return ( qr/([^$wc]+)/o, qr/^$ignoref([$wc]+?)$ignorel$/io, qr/^$match_string$/, ); And here's the time killer: --------------------------- sub highlight { my ( $text_ref, $stemmer_function, $wc_regexp, $extract_regexp, $match_regexp ) = @_; my $last = 0; # Split source into swish-words and non-swish words. # Every other element is a swish-word due to the capture my @words = split /$wc_regexp/, $$text_ref; # This keeps track of what words to print (context printing) my @flags; $flags[$#words] = 0; # Extend array. my $occurrences = $Occurrences; # limit number of matches my $pos = 0; while ( $pos <= $#words ) { # pull out the (begin_chars, match word, end_chars) if ( $words[$pos] =~ /$extract_regexp/ ) { my ( $begin, $word, $end ) = ( $1, $2, $3 ); # Stem word if enabled, else just use lower case my $test = $stemmer_function ? $stemmer_function->($word) : lc $word; $test ||= lc $word; # Not check if word matches if ( $test =~ /$match_regexp/ ) { $words[$pos] = "$begin<b>$word</b>$end"; # turn on display on both sides of word my $start = $pos - $Show_Words + 1; my $end = $pos + $Show_Words - 1; if ( $start < 0 ) { $end = $end - $start; $start = 0; } $end = $#words if $end > $#words; $flags[$_]++ for $start .. $end; # All done, and mark where to stop looking if ( $occurrences-- <= 0 ) { $last = $end; last; } } } $pos += 2; # Skip to next wordchar word } # Now build array of just words to display my @output; my $printing; my $first = 1; my $some_printed; for my $i ( 0 ..$#words ) { if ( $last && $i >= $last && $i < $#words ) { push @output, '...'; last; } # Is word marked to display? if ( $flags[$i] ) { push @output, '...' if !$printing++ && !$first; push @output, $words[$i]; $some_printed++; } else { $printing = 0; } $first = 0; } # if none highlighted show $Min_Words if ( !$some_printed ) { for my $i ( 0 .. $Min_Words ) { last if $i >= $#words; push @output, $words[$i]; } } push @output,'...' if !$printing; return join '', @output; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Optimizing regular expressions
by japhy (Canon) on Jun 02, 2001 at 01:11 UTC | |
by moseley (Acolyte) on Jun 02, 2001 at 02:14 UTC | |
by japhy (Canon) on Jun 02, 2001 at 18:41 UTC | |
by moseley (Acolyte) on Jun 02, 2001 at 23:09 UTC | |
by japhy (Canon) on Jun 03, 2001 at 04:07 UTC | |
| |
|
(boo)Re: Optimizing regular expressions
by boo_radley (Parson) on Jun 02, 2001 at 01:00 UTC |