comment on

This will be a long one. This is a specific question, but I'm really more interested in learning how to optimize a bit of perl code. I would benchmark, but with this bit of code I'm not sure what to change. So I'm here with you now....

I'm using the swish-e search engine and I'm using regular expressions to highlight search terms. I must spilt the source document into individual words and then match each word against the word(s) in the query. This can be a slow process.

Overview: Swish defines "words" to be indexed by three settings:

wordchars - defines what characters can be in a word
ignorefirst - what characters to ignore at the start of a word
ignorelast - and what to ignore at the end of a word

For example,
wordchars = [a-z#], ignorefirst = [#]

Then if a document contained the string " foo#&#bar " swish would first split the string into "foo#" and "#bar" based on wordchars, then the ignorefirst would strip the "#" and swish would index two words "foo#" and "bar".

So you can see that highlighting the terms is more than just using:

    s[\b($word)\b][<b>$1</b>]g;
[download]

Now, swish is nice when searching and does half the work. If you enter a query swish will convert the query into "swish words" and dispaly them. So, if you query for "foo#&#bar" swish will return

Parsed words: foo# bar".

Onto the code.

The design is

A) create three compiled regular expressions for 1) splitting the source text into "swish words" and non-swish words, 2) an expression to extract the word to match out of this new "swish word" by removing ignorefirst and ignorelast characters, and 3) an expression to match this word against the words in the original query,

B) extract out the words from the source text,

C) compare against the regexp built from the "parsed words" query, and if a match then mark words on both sides of the matched word for display,

D) join together all marked words for output.

Note that there's also wildcard matching in swish so the query is "foo#&#bar*" ends up as "foo#" and "bar*" where any word that starts with the three letters bar will match. Words may also be stemmed with a word stemmer (as seen below).

Again, this code works, ok, but I just would like to make it run faster. It runs under mod_perl so it can't /o all the regular expressions. I hope this isn't too unclear:

Highlighting works by this call:

 $text = 
    highlight( \$text, $stemmer_function, set_match_regexp( \%heads ) 
+);


First, this builds the three regular expressions:
-------------------------------------------------

sub set_match_regexp {
my $header = shift;

my ($query, $wc, $ignoref, $ignorel ) =
    @{$header}{'parsed words',
         qw/wordcharacters ignorefirstchar ignorelastchar/};

$wc = quotemeta $wc;

# build regexp to match against the query words
# If a trailing "*" is found then do a wild card search.

my $match_string =
    join '|',
       map { substr( $_, -1, 1 ) eq '*'
                ? quotemeta( substr( $_, 0, -1) ) . "[$wc]*?"
                : quotemeta
           }
            # get rid of operator words and characters
            grep { ! /^(and|or|not|["()=])$/oi }
                split /\s+/, $query;


# These are to remove and capture the ignorefirst 
# and ingnorelast chars

for ( $ignoref, $ignorel ) {
    if ( $_ ) {
        $_ = quotemeta;
        $_ = "([$_]*)";
    } else {
        $_ = '()';
    }
}

# Avoid /i matching...
$wc .= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';  # Warning: dependent on tolower 
+used while indexing

# return the three compiled expressions
# 1) used to split source into swish/non-swish words
# 2) used to strip ignorefirst/ignorelast
# 3) used to match final word against query word(s)

return (
    qr/([^$wc]+)/o,
    qr/^$ignoref([$wc]+?)$ignorel$/io,
    qr/^$match_string$/,
);

And here's the time killer:
---------------------------

sub highlight {
my ( $text_ref, $stemmer_function,
     $wc_regexp, $extract_regexp, $match_regexp ) = @_;


my $last = 0;

# Split source into swish-words and non-swish words.
# Every other element is a swish-word due to the capture
my @words = split /$wc_regexp/, $$text_ref;


# This keeps track of what words to print (context printing)
my @flags;
$flags[$#words] = 0;  # Extend array.

my $occurrences = $Occurrences;  # limit number of matches

my $pos = 0;
while ( $pos <= $#words ) {

    # pull out the (begin_chars, match word, end_chars)
    if ( $words[$pos] =~ /$extract_regexp/ ) {

        my ( $begin, $word, $end ) = ( $1, $2, $3 );

        # Stem word if enabled, else just use lower case
        my $test = $stemmer_function
                   ? $stemmer_function->($word)
                   : lc $word;

        $test ||= lc $word;                       

        # Not check if word matches
        if ( $test =~ /$match_regexp/ ) {

            $words[$pos] = "$begin<b>$word</b>$end";


            # turn on display on both sides of word
            my $start = $pos - $Show_Words + 1;
            my $end   = $pos + $Show_Words - 1;
            if ( $start < 0 ) {
                $end = $end - $start;
                $start = 0;
            }
            
            $end = $#words if $end > $#words;

            $flags[$_]++ for $start .. $end;


            # All done, and mark where to stop looking
            if ( $occurrences-- <= 0 ) {
                $last = $end;
                last;
            }
        }
    }

   $pos += 2;  # Skip to next wordchar word
}


# Now build array of just words to display
my @output;

my $printing;
my $first = 1;
my $some_printed;

for my $i ( 0 ..$#words ) {

    if ( $last && $i >= $last && $i < $#words ) {
        push @output, '...';
        last;
    }

    # Is word marked to display?
    if ( $flags[$i] ) {

        push @output, '...' if !$printing++ && !$first;
        push @output, $words[$i];
        $some_printed++;

    } else {
        $printing = 0;
    }

    $first = 0;
}

# if none highlighted show $Min_Words
if ( !$some_printed ) {
    for my $i ( 0 .. $Min_Words ) {
        last if $i >= $#words;
        push @output, $words[$i];
    }
}
    
push @output,'...' if !$printing;

return join '', @output;
}
[download]

Tear away!

In reply to Optimizing regular expressions by moseley

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.