speeding up a regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: speeding up a regex by Aristotle (Chancellor) on Jan 03, 2006 at 15:09 UTC
how would I do this ? e.g. assume my keyword input file would look like this Something along these lines: `my @patterns = map { chomp; qr/\b\Q$_\E\b/ } <$filehandle>;` [download] Is there a more efficient way of doing the regex? There are various trade-offs; as long as you have few and simple patterns, combining them is more likely to slow things down than speed them up, because a combined pattern will cause the engine to waste a lot of effort backtracking to check alternatives. study has never been of any use to me, either. If you have a lot of patterns, you may want to look into Regexp::Assemble as a way of combining them. Makeshifts last the longest.	[reply] [d/l]
Re^2: speeding up a regex by Anonymous Monk on Jan 03, 2006 at 15:25 UTC
out of interest, how does using a variable in a regular expression affect Perl ?. I assume Perl recompiles the pattern each time through the loop because the variable may have changed ? So by shifting the keyword patterns out to a file am I therefore introducing more of a performance degredation ? Obviously I can test this out but it would be good to hear others views on this. Many thanks	[reply]
Re^3: speeding up a regex by ikegami (Patriarch) on Jan 03, 2006 at 15:44 UTC
In modern Perls -- I'm not sure which versions qualify here, maybe 5.6+ -- Perl will check whether the contents of the variable has changed. If the content of the variable has not changed, the regexp is not recompiled. For example, compare `my @words = ( 'foo', 'bar', ); foreach (@array) { foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes. print; last; } } }` [download] to `my @words = ( 'foo', 'bar', ); foreach my $word (@words) { foreach (@array) { if (/\Q$word/) { # GOOD!! regexp only recompiled when needed. print; last; } } }` [download] It's not always practical to change the order of the loops. For example, when one of them reads from a file. In that case, the solution is to precompile the regexps. For example, compare `my @words = ( 'foo', 'bar', ); while (<FILE>) { foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes. print; last; } } }` [download] to `my @words = ( 'foo', 'bar', ); # Precompile the regexps. my @regexps = map { qr/\Q$_/ } @words; while (<FILE>) { foreach my $regexp (@regexps) { if (/$regexps/) { # GOOD!! $regexp is a compiled regexp. #if ($_ =~ $regexp) { # GOOD!! Alternate syntax. print; last; } } }` [download] If you're trying to match constant strings rather than regexps, then I recommend Regexp::List: `use Regexp::List (); my @words = ( 'foo', 'bar', ); my $regexp = Regexp::List->new()->list2re(@words); while (<FILE>) { print if /$regexp/; #print if $_ =~ $regexp; # Alternate syntax. }` [download] By the way, `for ($loop_index = 0; $loop_index < $#patterns; $loop_index++) {` is much less readable and no more efficient than `for my $loop_index (0..$#patterns) {` You could also have used `foreach (@patterns) {` Finally, in your case, I'd use `my @patterns = qw( create drop delete update insert ); my $regexp; $regexp = Regexp::List->new(modifiers => 'i')->list2re(@patterns); $regexp = qr/\b(?:$regexp)\b/; ... while ($data = $sth->fetchrow_arrayref()) { # index is faster than regexps on constant strings. next if index(lc($data->[10]), 'tempdb') >= 0; if ($data->[13] =~ $regexp) { print "$data->[3] $data->[9] $data->[10] $data->[13]\n"; last; } }` [download] Update: Bug Fix: Changed `$word` to `$_` in `map`'s code block.	[reply] [d/l] [select]
Re^4: speeding up a regex by sgifford (Prior) on Jan 03, 2006 at 19:59 UTC
Re^5: speeding up a regex by ikegami (Patriarch) on Jan 03, 2006 at 21:07 UTC
Re^4: speeding up a regex by halley (Prior) on Jan 03, 2006 at 17:03 UTC
Re^3: speeding up a regex by Aristotle (Chancellor) on Jan 03, 2006 at 15:52 UTC
That depends on a whole lot of factors. In old perls, that was true; modern perls have optimisations to avoid recompiling patterns if the interpolated variables haven’t changed. However, note that a match such as `$foo =~ /$bar/` is magical if `$bar` is a pattern precompiled with qr, in that the pattern compilation phase is skipped completely. (This applies when the pattern consists of nothing but the interpolated variable.) And finally, the `(?{})` delayed evaluation construct lets you embed pre-compiled patterns in another pattern without recompiling the embedded pattern. Makeshifts last the longest.	[reply] [d/l]
Re^4: speeding up a regex by Perl Mouse (Chaplain) on Jan 03, 2006 at 16:04 UTC
Re: speeding up a regex by ides (Deacon) on Jan 03, 2006 at 15:16 UTC
To load up keywords from a file you would need to do something like this: `my @patterns; open( my $fh, "input-file.txt") or die "Cannot open input file:$!"; while( my $line = <$fh> ) { chomp( $line ); push( @patterns, qr/\b$line\b/ ); } close($fh);` [download] Your regexes are already fairly speedy by using the qr// operator to precompile them. I use this same method to look for roughly 72,000 keywords in thousands of 5-20k full text documents at NewsCloud and it can process a single full text article in under a second. One thing that will make your life easier is to start using foreach loops instead of for loops like you are. Much less typing and confusing. Remember, you aren't using C anymore :) Frank Wiles <frank@wiles.org> http://www.wiles.org	[reply] [d/l]
Re: speeding up a regex by kwaping (Priest) on Jan 03, 2006 at 16:02 UTC
Instead of looping, you can compile everything into one simple regex, like so: `my $pattern = qr/\b(create\|delete\|insert\|update\|drop)\b/;` [download] This gave me roughly a 3.5x performance increase over the loop method. It should be trivial to convert that to using an external file.	[reply] [d/l]
Re^2: speeding up a regex by Fletch (Bishop) on Jan 03, 2006 at 18:34 UTC
See also Regex::PreSuf for a slightly fancier option that might be even faster still if your different words share parts.	[reply]
Re^3: speeding up a regex by nothingmuch (Priest) on Jan 04, 2006 at 01:53 UTC
Thanks to demerphq bleadperl does trie optimizations internally. See the first patch, as well as the second one and the third one. -nuffin zz zZ Z Z #!perl	[reply]
Re: speeding up a regex by mrborisguy (Hermit) on Jan 03, 2006 at 16:29 UTC
One thing you could do to speed things up is to let SQL do what it is good for. For example, say your `$data->[10]` row is called 'database'. Then get SQL to remove everything that has 'tempdb' in it with a `WHERE database NOT LIKE '%tempdb%'`. And also, if your list of keywords to search for are never actually regexes, just a list of keywords, then you can add something like: `WHERE database NOT LIKE '%tempdb%' AND ( keyword LIKE '%create%' OR keyword LIKE '%delete%' OR keyword LIKE '%insert%' ... )` [download] I would achieve that by some code like this probably. my @patterns = qw/create delete insert update drop/; my $query = "@sqlstatement"; # don't know what you already have in @sq +lstatement though $query .= qq{WHERE database NOT LIKE '%tempdb%'; $query .= qq{AND (}; $query .= join " OR ", map { qq{keyword LIKE '%$_%'} } @patterns; $query .= qq{)}; sth=$dbh->prepare( $query ); $sth->execute; while ($data = $sth->fetchrow_arrayref()) { # the query did all of our checking, so we don't need to check anyt +hing. print "$data->[3] $data->[9] $data->[10] $data->[13]\n"; } [download] -Bryan	[reply] [d/l] [select]
Re: speeding up a regex by jhourcle (Prior) on Jan 03, 2006 at 15:40 UTC
I'd recommend two minor changes -- Put the case insensitive flag at the top, rather than for each iteration of the loop. (I haven't benchmarked it, but I'm just one to do as much repetitive stuff before the loop, if at all possible) Use a single regex for matching, as you don't seem to be doing anything different based on which term within the loop matched.* * I know, that's not specifically true, as with your case, you can match each item in the list once, but I'm making a general assumption that your items look to be SQL statements, and if they're single statements, they're most likely mutually exclusive, so long as you don't have sub queries I'd probably rewrite it something like: Read more... (695 Bytes) You can also use perl's foreach loops to deal with iterating through a list, when the actual index isn't important. (yes, I know, it can be called with 'for', but I always think of C's for loops when I use 'for')	[reply] [d/l]
Re: speeding up a regex by sgifford (Prior) on Jan 03, 2006 at 19:48 UTC
It's fairly easy to use Benchmark to try out different solutions and see which is fastest. Here's a sample of testing multiple regexps (your original solution), one big regexp, and using `index`: Read more... (2 kB) In this benchmark, the one big regexp solution is fastest: `Benchmark: timing 100000 iterations... One Big Regexp: 7 wallclock secs (6.22 CPU) @ 16077.17/s (n=100000) Several Regexp: 12 wallclock secs (11.27 CPU) @ 8873.11/s (n=100000) With index(): 11 wallclock secs (8.71 CPU) @ 11481.06/s (n=100000)` [download] But the results you get running on your own data will be more useful.	[reply] [d/l] [select]
Re^2: speeding up a regex by GrandFather (Saint) on Jan 03, 2006 at 22:17 UTC
I like the resuts from cmpthese much better than those from timethese. Here is the same benchmark using cmpthese Read more... Benchmark code (2 kB) Prints: `bigre: (?-xism:\b(?:create\|drop\|delete\|update\|insert)\b) several_re: 5 one_re: 5 use_index: 5 Rate Several Regexp With index() One Big Regexp Several Regexp 24634/s -- -24% -50% With index() 32367/s 31% -- -34% One Big Regexp 49358/s 100% 52% --` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: speeding up a regex by Perl Mouse (Chaplain) on Jan 03, 2006 at 15:45 UTC
You have two nested loops, an outer loop extracting data from the database, and an inner loop iterating over the regexes. This means that each time perl encounters a regex, the regex is different from the previous one - which will cause the regex to be compiled again. You might want to try to loop over the regexes in the outer loop, and over the data in the inner loop. There's no garantee that it will be faster - your regexes are already compiled, and it might be that re-visiting the database costs too much. It may even be slower. But it's worth a shot. `Perl --((8:>*`	[reply]
Re: speeding up a regex by Anonymous Monk on Jan 04, 2006 at 10:42 UTC
Thanks for all the responses on this. Fantastic !	[reply]