Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a couple of questions surrounding the speeding up of a regex I've written. Please bear in mind I'm not a Perl guru so I'd like to learn a bit from the masters on how best to write more efficent code.

Here's a snippet of my code

my @patterns = ( qr/\bcreate\b/, qr/\bdrop\b/, qr/\bdelete\b/, qr/\bupdate\b/, qr/\binsert\b/, ); open a database connection here (using Sybase::DBD) and create a statement string to execute $sth=$dbh->prepare("@sqlstatement"); $sth->execute; while ($data = $sth->fetchrow_arrayref()) { next if($data->[10] =~ /tempdb/i); for ($loop_index = 0; $loop_index < $#patterns; $loop_index++) { if($data->[13] =~ /$patterns[$loop_index]/i) { print "$data->[3] + $data->[9] $data->[10] $data->[13]\n"; } } }
This extracts an audit trail from a sybase database and examines each row via the regex above. It produces the output I want but I can't help feeling there's a more efficient way of doing the regex bit.

The first question is , if I wanted to get the keywords from a file (so I can build up a dictionary of keywords to search for) instead of hardcoding them as complied regular expressions as in the code above , how would I do this ? e.g. assume my keyword input file would look like this

create delete insert update drop

Is there a more efficient way of doing the regex ?... I read somewhere about the possibility of using the study function to improve the performance.

Any help appreciated, thanks in advance

Replies are listed 'Best First'.
Re: speeding up a regex
by Aristotle (Chancellor) on Jan 03, 2006 at 15:09 UTC

    how would I do this ? e.g. assume my keyword input file would look like this

    Something along these lines:

    my @patterns = map { chomp; qr/\b\Q$_\E\b/ } <$filehandle>;

    Is there a more efficient way of doing the regex?

    There are various trade-offs; as long as you have few and simple patterns, combining them is more likely to slow things down than speed them up, because a combined pattern will cause the engine to waste a lot of effort backtracking to check alternatives. study has never been of any use to me, either. If you have a lot of patterns, you may want to look into Regexp::Assemble as a way of combining them.

    Makeshifts last the longest.

      out of interest, how does using a variable in a regular expression affect Perl ?. I assume Perl recompiles the pattern each time through the loop because the variable may have changed ?
      So by shifting the keyword patterns out to a file am I therefore introducing more of a performance degredation ?
      Obviously I can test this out but it would be good to hear others views on this.
      Many thanks

        In modern Perls -- I'm not sure which versions qualify here, maybe 5.6+ -- Perl will check whether the contents of the variable has changed. If the content of the variable has not changed, the regexp is not recompiled.

        For example, compare

        my @words = ( 'foo', 'bar', ); foreach (@array) { foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes. print; last; } } }

        to

        my @words = ( 'foo', 'bar', ); foreach my $word (@words) { foreach (@array) { if (/\Q$word/) { # GOOD!! regexp only recompiled when needed. print; last; } } }

        It's not always practical to change the order of the loops. For example, when one of them reads from a file. In that case, the solution is to precompile the regexps. For example, compare

        my @words = ( 'foo', 'bar', ); while (<FILE>) { foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes. print; last; } } }

        to

        my @words = ( 'foo', 'bar', ); # Precompile the regexps. my @regexps = map { qr/\Q$_/ } @words; while (<FILE>) { foreach my $regexp (@regexps) { if (/$regexps/) { # GOOD!! $regexp is a compiled regexp. #if ($_ =~ $regexp) { # GOOD!! Alternate syntax. print; last; } } }

        If you're trying to match constant strings rather than regexps, then I recommend Regexp::List:

        use Regexp::List (); my @words = ( 'foo', 'bar', ); my $regexp = Regexp::List->new()->list2re(@words); while (<FILE>) { print if /$regexp/; #print if $_ =~ $regexp; # Alternate syntax. }

        By the way,
        for ($loop_index = 0; $loop_index < $#patterns; $loop_index++) {
        is much less readable and no more efficient than
        for my $loop_index (0..$#patterns) {
        You could also have used
        foreach (@patterns) {

        Finally, in your case, I'd use

        my @patterns = qw( create drop delete update insert ); my $regexp; $regexp = Regexp::List->new(modifiers => 'i')->list2re(@patterns); $regexp = qr/\b(?:$regexp)\b/; ... while ($data = $sth->fetchrow_arrayref()) { # index is faster than regexps on constant strings. next if index(lc($data->[10]), 'tempdb') >= 0; if ($data->[13] =~ $regexp) { print "$data->[3] $data->[9] $data->[10] $data->[13]\n"; last; } }

        Update: Bug Fix: Changed $word to $_ in map's code block.

        That depends on a whole lot of factors. In old perls, that was true; modern perls have optimisations to avoid recompiling patterns if the interpolated variables haven’t changed.

        However, note that a match such as $foo =~ /$bar/ is magical if $bar is a pattern precompiled with qr, in that the pattern compilation phase is skipped completely. (This applies when the pattern consists of nothing but the interpolated variable.)

        And finally, the (?{}) delayed evaluation construct lets you embed pre-compiled patterns in another pattern without recompiling the embedded pattern.

        Makeshifts last the longest.

Re: speeding up a regex
by ides (Deacon) on Jan 03, 2006 at 15:16 UTC

    To load up keywords from a file you would need to do something like this:

    my @patterns; open( my $fh, "input-file.txt") or die "Cannot open input file:$!"; while( my $line = <$fh> ) { chomp( $line ); push( @patterns, qr/\b$line\b/ ); } close($fh);

    Your regexes are already fairly speedy by using the qr// operator to precompile them. I use this same method to look for roughly 72,000 keywords in thousands of 5-20k full text documents at NewsCloud and it can process a single full text article in under a second.

    One thing that will make your life easier is to start using foreach loops instead of for loops like you are. Much less typing and confusing. Remember, you aren't using C anymore :)

    Frank Wiles <frank@wiles.org>
    http://www.wiles.org

Re: speeding up a regex
by kwaping (Priest) on Jan 03, 2006 at 16:02 UTC
    Instead of looping, you can compile everything into one simple regex, like so:
    my $pattern = qr/\b(create|delete|insert|update|drop)\b/;

    This gave me roughly a 3.5x performance increase over the loop method. It should be trivial to convert that to using an external file.

      See also Regex::PreSuf for a slightly fancier option that might be even faster still if your different words share parts.

Re: speeding up a regex
by mrborisguy (Hermit) on Jan 03, 2006 at 16:29 UTC

    One thing you could do to speed things up is to let SQL do what it is good for. For example, say your $data->[10] row is called 'database'. Then get SQL to remove everything that has 'tempdb' in it with a WHERE database NOT LIKE '%tempdb%'. And also, if your list of keywords to search for are never actually regexes, just a list of keywords, then you can add something like:

    WHERE database NOT LIKE '%tempdb%' AND ( keyword LIKE '%create%' OR keyword LIKE '%delete%' OR keyword LIKE '%insert%' ... )
    I would achieve that by some code like this probably.
    my @patterns = qw/create delete insert update drop/; my $query = "@sqlstatement"; # don't know what you already have in @sq +lstatement though $query .= qq{WHERE database NOT LIKE '%tempdb%'; $query .= qq{AND (}; $query .= join " OR ", map { qq{keyword LIKE '%$_%'} } @patterns; $query .= qq{)}; sth=$dbh->prepare( $query ); $sth->execute; while ($data = $sth->fetchrow_arrayref()) { # the query did all of our checking, so we don't need to check anyt +hing. print "$data->[3] $data->[9] $data->[10] $data->[13]\n"; }

        -Bryan

Re: speeding up a regex
by jhourcle (Prior) on Jan 03, 2006 at 15:40 UTC

    I'd recommend two minor changes --

    1. Put the case insensitive flag at the top, rather than for each iteration of the loop. (I haven't benchmarked it, but I'm just one to do as much repetitive stuff before the loop, if at all possible)
    2. Use a single regex for matching, as you don't seem to be doing anything different based on which term within the loop matched.*

    * I know, that's not specifically true, as with your case, you can match each item in the list once, but I'm making a general assumption that your items look to be SQL statements, and if they're single statements, they're most likely mutually exclusive, so long as you don't have sub queries

    I'd probably rewrite it something like:

    You can also use perl's foreach loops to deal with iterating through a list, when the actual index isn't important. (yes, I know, it can be called with 'for', but I always think of C's for loops when I use 'for')

Re: speeding up a regex
by sgifford (Prior) on Jan 03, 2006 at 19:48 UTC
    It's fairly easy to use Benchmark to try out different solutions and see which is fastest. Here's a sample of testing multiple regexps (your original solution), one big regexp, and using index:

    In this benchmark, the one big regexp solution is fastest:

    Benchmark: timing 100000 iterations... One Big Regexp: 7 wallclock secs (6.22 CPU) @ 16077.17/s (n=100000) Several Regexp: 12 wallclock secs (11.27 CPU) @ 8873.11/s (n=100000) With index(): 11 wallclock secs (8.71 CPU) @ 11481.06/s (n=100000)
    But the results you get running on your own data will be more useful.

      I like the resuts from cmpthese much better than those from timethese. Here is the same benchmark using cmpthese

      Prints:

      bigre: (?-xism:\b(?:create|drop|delete|update|insert)\b) several_re: 5 one_re: 5 use_index: 5 Rate Several Regexp With index() One Big Regexp Several Regexp 24634/s -- -24% -50% With index() 32367/s 31% -- -34% One Big Regexp 49358/s 100% 52% --

      DWIM is Perl's answer to Gödel
Re: speeding up a regex
by Perl Mouse (Chaplain) on Jan 03, 2006 at 15:45 UTC
    You have two nested loops, an outer loop extracting data from the database, and an inner loop iterating over the regexes. This means that each time perl encounters a regex, the regex is different from the previous one - which will cause the regex to be compiled again. You might want to try to loop over the regexes in the outer loop, and over the data in the inner loop.

    There's no garantee that it will be faster - your regexes are already compiled, and it might be that re-visiting the database costs too much. It may even be slower.

    But it's worth a shot.

    Perl --((8:>*
Re: speeding up a regex
by Anonymous Monk on Jan 04, 2006 at 10:42 UTC
    Thanks for all the responses on this. Fantastic !