in reply to speeding up a regex

how would I do this ? e.g. assume my keyword input file would look like this

Something along these lines:

my @patterns = map { chomp; qr/\b\Q$_\E\b/ } <$filehandle>;

Is there a more efficient way of doing the regex?

There are various trade-offs; as long as you have few and simple patterns, combining them is more likely to slow things down than speed them up, because a combined pattern will cause the engine to waste a lot of effort backtracking to check alternatives. study has never been of any use to me, either. If you have a lot of patterns, you may want to look into Regexp::Assemble as a way of combining them.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: speeding up a regex
by Anonymous Monk on Jan 03, 2006 at 15:25 UTC
    out of interest, how does using a variable in a regular expression affect Perl ?. I assume Perl recompiles the pattern each time through the loop because the variable may have changed ?
    So by shifting the keyword patterns out to a file am I therefore introducing more of a performance degredation ?
    Obviously I can test this out but it would be good to hear others views on this.
    Many thanks

      In modern Perls -- I'm not sure which versions qualify here, maybe 5.6+ -- Perl will check whether the contents of the variable has changed. If the content of the variable has not changed, the regexp is not recompiled.

      For example, compare

      my @words = ( 'foo', 'bar', ); foreach (@array) { foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes. print; last; } } }

      to

      my @words = ( 'foo', 'bar', ); foreach my $word (@words) { foreach (@array) { if (/\Q$word/) { # GOOD!! regexp only recompiled when needed. print; last; } } }

      It's not always practical to change the order of the loops. For example, when one of them reads from a file. In that case, the solution is to precompile the regexps. For example, compare

      my @words = ( 'foo', 'bar', ); while (<FILE>) { foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes. print; last; } } }

      to

      my @words = ( 'foo', 'bar', ); # Precompile the regexps. my @regexps = map { qr/\Q$_/ } @words; while (<FILE>) { foreach my $regexp (@regexps) { if (/$regexps/) { # GOOD!! $regexp is a compiled regexp. #if ($_ =~ $regexp) { # GOOD!! Alternate syntax. print; last; } } }

      If you're trying to match constant strings rather than regexps, then I recommend Regexp::List:

      use Regexp::List (); my @words = ( 'foo', 'bar', ); my $regexp = Regexp::List->new()->list2re(@words); while (<FILE>) { print if /$regexp/; #print if $_ =~ $regexp; # Alternate syntax. }

      By the way,
      for ($loop_index = 0; $loop_index < $#patterns; $loop_index++) {
      is much less readable and no more efficient than
      for my $loop_index (0..$#patterns) {
      You could also have used
      foreach (@patterns) {

      Finally, in your case, I'd use

      my @patterns = qw( create drop delete update insert ); my $regexp; $regexp = Regexp::List->new(modifiers => 'i')->list2re(@patterns); $regexp = qr/\b(?:$regexp)\b/; ... while ($data = $sth->fetchrow_arrayref()) { # index is faster than regexps on constant strings. next if index(lc($data->[10]), 'tempdb') >= 0; if ($data->[13] =~ $regexp) { print "$data->[3] $data->[9] $data->[10] $data->[13]\n"; last; } }

      Update: Bug Fix: Changed $word to $_ in map's code block.

        In modern Perls -- I'm not sure which versions qualify here, maybe 5.6+ -- Perl will check whether the contents of the variable has changed. If the content of the variable has not changed, the regexp is not recompiled.
        ...
        foreach my $word (@words) { if (/\Q$word/) { # BAD!! $word always changes.
        Hi ikegami,

        Thanks for the very informative post! A quick question: I thought that in a loop like the one above, the iterator variable ($word) was temporarily aliased to each value of the array. So I would expect that as long as the array's contents didn't change, perl would know not to recompile the RE, and so both of your above examples would be the same speed.

        But a quick Benchmark agrees with you: putting the RE in the outer loop is about twice as fast as in the inner loop.

        Any hints as to what's wrong with my understanding of variable aliasing or RE caching?

        Thanks!

        Every once in a while, I see a post like this and wish I could ++ it more than once. For newcomers to a language, seeing "badform/goodform" examples is really a great way to understand the benefits of different approaches.

        --
        [ e d @ h a l l e y . c c ]

      That depends on a whole lot of factors. In old perls, that was true; modern perls have optimisations to avoid recompiling patterns if the interpolated variables haven’t changed.

      However, note that a match such as $foo =~ /$bar/ is magical if $bar is a pattern precompiled with qr, in that the pattern compilation phase is skipped completely. (This applies when the pattern consists of nothing but the interpolated variable.)

      And finally, the (?{}) delayed evaluation construct lets you embed pre-compiled patterns in another pattern without recompiling the embedded pattern.

      Makeshifts last the longest.

        modern perls have optimisations to avoid recompiling patterns if the interpolated variables haven’t changed
        Modern perls are even smarter than that. They look at the string after interpolation, and if that's the same, the regex isn't executed:
        $ cat xx #!/usr/bin/perl use strict; use warnings; foreach my $x (["foo", "bar"], ["fo", "obar"]) { my ($foo, $bar) = @$x; "" =~ /$foo/; "" =~ /$foo$bar/; } __END__ $ perl -Dr xx 2>&1 | perl -nle 'if (/EXECUTING/ .. eof) {print if/^Compiling/}' Compiling REx `foo' Compiling REx `foobar' Compiling REx `fo'
        Perl won't recompile the second regex, since while the values of both $foo and $bar have changed, the value of "$foo$bar" hasn't.
        Perl --((8:>*