in reply to Surprisingly poor regex performance

How about anchoring the newline at the start?
while ($mmap =~ m/\n?(.*$pat.*\n)/omg) {
Note that .* is equivalent to [^\n]* if you're not using /s. I added the * at the end, because it fits the /dict/words lookup.

Removing the ? at the beginning makes it even faster, but then you need to handle the first line specially.

Replies are listed 'Best First'.
Re^2: Surprisingly poor regex performance
by dragonchild (Archbishop) on Dec 13, 2004 at 16:55 UTC
    What about m/(?:\A|\n)(.*$pat.*\n)/omg?

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      Good stuff.. It's faster :-)

      #!/usr/local/bin/perl -w use strict; use Benchmark; my $pat = 'foo'; $/=undef; open F,"<","/usr/share/dict/words"; my $mmap = <F>; close F; timethese( 100, { # 'orig' => sub { # while ($mmap =~ m/^([^\n]*$pat[^\n]\n)/omg) { # warn $1; # } # }, 'fast' => sub { while ($mmap =~ m/\n?(.*$pat.*\n)/omg) { warn $1; } }, 'better' => sub { while ($mmap =~ m/(?:\A|\n)(.*$pat.*\n)/omg) { warn $1; } }, } ) __END__ Benchmark: timing 100 iterations of better, fast... better: 26 wallclock secs (23.13 usr + 0.06 sys = 23.19 CPU) @ 4 +.31/s (n=100) fast: 41 wallclock secs (39.73 usr + 0.09 sys = 39.82 CPU) @ 2 +.51/s (n=100)
      (original takes about 25 seconds per iteration on my machine)

Re^2: Surprisingly poor regex performance
by sgifford (Prior) on Dec 13, 2004 at 17:12 UTC
    Ah, thanks, that makes a big difference (though it's still slower than using index/rindex). Do you have any explanation for why? My understanding was that ^ meant pretty much the same thing as \n?.

    dragonchild's code is even faster, in some cases by a factor of 3.

      Well, ^ doesn't mean the same as \n? at all. Like not even close.

      According to Programming Perl 3rd edition pages 150 and 159, ^, when used with the /m modifier, means to match after embedded newlines or the beginning of the string.

      \n? means something akin to possibly match a newline, but maybe not. That doesn't provide the engine with any understanding of where to match. Had you said something like /\n(.*$pat.*\n)/, you would have been better off (except for not matching the first line). Of course, with /m, you should be able to do /^(.*$pat.*)$/omsg and it should be about as efficient as my regex.

      With regexes, it's always better to be as explicit as possible. This will allow the engine to make a number of optimizations. Some of those optimizations, as you have found out, can mean the difference between 2500 seconds and 23 seconds.

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        You're right of course; I shouldn't have said meant the same as; I meant would have the same effect as.

        Still, I think it's accurate to say that ^ means very nearly the same as your example: (?:\A|\n), so it's still quite surprising to me that yours is so much faster.

      Good question. I guess using ^ to match newlines is probably just not very well optimized in the engine.

      The usual trick to make regexes faster is to be as explicit as you can be (like spelling out the newlines) and it helped here. I don't know enough about perl's regex internals to really explain why it works as well as it does in this case.