Optimizing regular expressions

moseley has asked for the wisdom of the Perl Monks concerning the following question:

This will be a long one. This is a specific question, but I'm really more interested in learning how to optimize a bit of perl code. I would benchmark, but with this bit of code I'm not sure what to change. So I'm here with you now....

I'm using the swish-e search engine and I'm using regular expressions to highlight search terms. I must spilt the source document into individual words and then match each word against the word(s) in the query. This can be a slow process.

Overview: Swish defines "words" to be indexed by three settings:

wordchars - defines what characters can be in a word
ignorefirst - what characters to ignore at the start of a word
ignorelast - and what to ignore at the end of a word

For example,
wordchars = [a-z#], ignorefirst = [#]

Then if a document contained the string " foo#&#bar " swish would first split the string into "foo#" and "#bar" based on wordchars, then the ignorefirst would strip the "#" and swish would index two words "foo#" and "bar".

So you can see that highlighting the terms is more than just using:

    s[\b($word)\b][<b>$1</b>]g;
[download]

Now, swish is nice when searching and does half the work. If you enter a query swish will convert the query into "swish words" and dispaly them. So, if you query for "foo#&#bar" swish will return

Parsed words: foo# bar".

Onto the code.

The design is

A) create three compiled regular expressions for 1) splitting the source text into "swish words" and non-swish words, 2) an expression to extract the word to match out of this new "swish word" by removing ignorefirst and ignorelast characters, and 3) an expression to match this word against the words in the original query,

B) extract out the words from the source text,

C) compare against the regexp built from the "parsed words" query, and if a match then mark words on both sides of the matched word for display,

D) join together all marked words for output.

Note that there's also wildcard matching in swish so the query is "foo#&#bar*" ends up as "foo#" and "bar*" where any word that starts with the three letters bar will match. Words may also be stemmed with a word stemmer (as seen below).

Again, this code works, ok, but I just would like to make it run faster. It runs under mod_perl so it can't /o all the regular expressions. I hope this isn't too unclear:

Highlighting works by this call:

 $text = 
    highlight( \$text, $stemmer_function, set_match_regexp( \%heads ) 
+);


First, this builds the three regular expressions:
-------------------------------------------------

sub set_match_regexp {
my $header = shift;

my ($query, $wc, $ignoref, $ignorel ) =
    @{$header}{'parsed words',
         qw/wordcharacters ignorefirstchar ignorelastchar/};

$wc = quotemeta $wc;

# build regexp to match against the query words
# If a trailing "*" is found then do a wild card search.

my $match_string =
    join '|',
       map { substr( $_, -1, 1 ) eq '*'
                ? quotemeta( substr( $_, 0, -1) ) . "[$wc]*?"
                : quotemeta
           }
            # get rid of operator words and characters
            grep { ! /^(and|or|not|["()=])$/oi }
                split /\s+/, $query;


# These are to remove and capture the ignorefirst 
# and ingnorelast chars

for ( $ignoref, $ignorel ) {
    if ( $_ ) {
        $_ = quotemeta;
        $_ = "([$_]*)";
    } else {
        $_ = '()';
    }
}

# Avoid /i matching...
$wc .= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';  # Warning: dependent on tolower 
+used while indexing

# return the three compiled expressions
# 1) used to split source into swish/non-swish words
# 2) used to strip ignorefirst/ignorelast
# 3) used to match final word against query word(s)

return (
    qr/([^$wc]+)/o,
    qr/^$ignoref([$wc]+?)$ignorel$/io,
    qr/^$match_string$/,
);

And here's the time killer:
---------------------------

sub highlight {
my ( $text_ref, $stemmer_function,
     $wc_regexp, $extract_regexp, $match_regexp ) = @_;


my $last = 0;

# Split source into swish-words and non-swish words.
# Every other element is a swish-word due to the capture
my @words = split /$wc_regexp/, $$text_ref;


# This keeps track of what words to print (context printing)
my @flags;
$flags[$#words] = 0;  # Extend array.

my $occurrences = $Occurrences;  # limit number of matches

my $pos = 0;
while ( $pos <= $#words ) {

    # pull out the (begin_chars, match word, end_chars)
    if ( $words[$pos] =~ /$extract_regexp/ ) {

        my ( $begin, $word, $end ) = ( $1, $2, $3 );

        # Stem word if enabled, else just use lower case
        my $test = $stemmer_function
                   ? $stemmer_function->($word)
                   : lc $word;

        $test ||= lc $word;                       

        # Not check if word matches
        if ( $test =~ /$match_regexp/ ) {

            $words[$pos] = "$begin<b>$word</b>$end";


            # turn on display on both sides of word
            my $start = $pos - $Show_Words + 1;
            my $end   = $pos + $Show_Words - 1;
            if ( $start < 0 ) {
                $end = $end - $start;
                $start = 0;
            }
            
            $end = $#words if $end > $#words;

            $flags[$_]++ for $start .. $end;


            # All done, and mark where to stop looking
            if ( $occurrences-- <= 0 ) {
                $last = $end;
                last;
            }
        }
    }

   $pos += 2;  # Skip to next wordchar word
}


# Now build array of just words to display
my @output;

my $printing;
my $first = 1;
my $some_printed;

for my $i ( 0 ..$#words ) {

    if ( $last && $i >= $last && $i < $#words ) {
        push @output, '...';
        last;
    }

    # Is word marked to display?
    if ( $flags[$i] ) {

        push @output, '...' if !$printing++ && !$first;
        push @output, $words[$i];
        $some_printed++;

    } else {
        $printing = 0;
    }

    $first = 0;
}

# if none highlighted show $Min_Words
if ( !$some_printed ) {
    for my $i ( 0 .. $Min_Words ) {
        last if $i >= $#words;
        push @output, $words[$i];
    }
}
    
push @output,'...' if !$printing;

return join '', @output;
}
[download]

Tear away!

Comment on Optimizing regular expressions Select or Download Code

Replies are listed 'Best First'.
Re: Optimizing regular expressions by japhy (Canon) on Jun 02, 2001 at 01:11 UTC
As one of the resident regexiacs, I'm going to spend a lot of time examining your program. Just so you know, if I had three strings, `$word`, `$ig_first`, and `$ig_last`, which held the characters for each of those three classes, this is how I would construct a regex to match words from a text stream: #!/usr/bin/perl -wl ### this code assumes that there are no characters ### in the "ignore_last" class that AREN'T in the ### "word" class -- it might seem silly that there ### would be, but still, that's how I'm coding this use strict; my $text_stream = q{foo#&#bar}; my $ig_first = '#'; my $ig_last = ''; my $word = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#'; my $pre = length($ig_first) ? qr/[\Q$ig_first\E]/ : ''; my $post = length($ig_last) ? qr/[\Q$ig_last\E]/ : ''; my $inside = length($ig_last) ? qr/[\Q$ig_last\E]+/ : ''; my ($match, @words); { # remove chars from $word that are matched by $post my $reg = $word; $reg =~ s/$post+//g if $post; $reg = qr/[\Q$reg\E]/; # unroll the loop: $match = qr{ ($pre) # pre chars (save to $1) ( # (save to $2) $reg+ # one or more regular chars \| # OR $reg* # zero or more regular chars (?: $inside # one or more post chars $reg+ # one or more non-post chars )+ # this chunk one or more times ) }x; # /x for extended mode } $text_stream =~ s[$match]{ push @words, $2; "$1<b>$2</b>" }eg; print $text_stream; print "words: @words"; [download] </code> `japhy` -- Perl and Regex Hacker	[reply] [d/l]
Re: Re: Optimizing regular expressions by moseley (Acolyte) on Jun 02, 2001 at 02:14 UTC
I appreciate the help, and I'm sorry for such a general question. One thing to make clear is that the ignore_first and ignore_last characters are subsets of wordchars. The point of those settings is to allow characters within a word, but not at the start/end. The classic example is that a dot is ok within a host name, but not at the end (e.g. at the end of a sentence). The other thing that complicates this a bit is I'm not showing all the original text with the words highlighted in the final output, but rather just a few words on either side of the highlighted text. Like a google search shows. Thus, I also need to be careful not to print words twice when highlighted words are close together. That is the reason I split the source text into an array of "swish" words and non swish words -- so I could easily mark words on either side of the matched word. `@words = split /([^$wordchars])/, $source_text;<p>` [download] I use two arrays to track this now (instead of an array of arrays) as it was faster to avoid all the dereferencing. Here's a specific question: The "problem" with the above code is that I still need to remove the leading and trailing ignore characters. So, that's an extra pattern match for every word - one time for the split, and then another to extract out a word from its ignore chars. I tried to find an expression to use in the above spilt that would do this in one shot, but it was looking like a complicated expression that might be slower than doing two matches. But I never found a pattern that I could test. I also wonder if using a repeating pattern with /g might be faster than my word-by-word checking. But then I'm back to the problem of how to print the words around the match. Thanks again for your help.	[reply] [d/l]
Re: Re: Re: Optimizing regular expressions by japhy (Canon) on Jun 02, 2001 at 18:41 UTC
I wrote up a module you might find useful. Let me know if it, or the test program, needs some documentation. It appears to be quite nifty. Swish.pm `japhy` -- Perl and Regex Hacker	[reply]
Re: Optimizing regular expressions by moseley (Acolyte) on Jun 02, 2001 at 23:09 UTC
Swish module (was Re: Re: Optimizing regular expressions) by japhy (Canon) on Jun 03, 2001 at 04:07 UTC
Some notes below your chosen depth have not been shown here
(boo)Re: Optimizing regular expressions by boo_radley (Parson) on Jun 02, 2001 at 01:00 UTC
your request is hazy, seeker. reading the title, and with heavy use of `PSI::ESP qw(kreskin)`, I imagine your question lies specifically in making these three statements faster/better/easier : `qr/([^$wc]+)/o, qr/^$ignoref([$wc]+?)$ignorel$/io, qr/^$match_string$/,` [download] if this isn't the case, tighten up your question and ask again. Be specific and concentrate on your petition. here's a regexp free implementation, in test form. `use strict; my @wordchars = ("a".."z","#"); my @ignorefirst = ("#"); my $test=" foo#&#bar&#baz "; my $wordchars=join "",@wordchars; my $ignorefirst=join "",@ignorefirst; # # get words # eval ("\$test=~tr/$wordchars/\\n/c;"); print $test; my @words=split /\n/,$test; # # now that the words are obtained, work on removing any ignorefirst # foreach (@words) { eval ("substr (\$_,0,1)=~ tr/$ignorefirst//d"); # ignorelast would be similar. } print "search terms are : "; foreach (@words) {print "$_ ";}` [download] I suppose there's a performance hit for the 2 evals, but I don't know how they'd compare to your methods. I imagine you could change the relevant lines to something like `substr (\$_,0,1)=~ s/[$ignorefirst]//;` [download] but that's untested, but I imagine (really imagining, I've no idea) it'd be pretty optimized by the regex engine since it's but one char.	[reply] [d/l] [select]