Re: Surprisingly poor regex performance

Replies are listed 'Best First'.
Re^2: Surprisingly poor regex performance by dragonchild (Archbishop) on Dec 13, 2004 at 16:55 UTC
What about `m/(?:\A\|\n)(.$pat.\n)/omg`? Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l]
Re^3: Surprisingly poor regex performance by Joost (Canon) on Dec 13, 2004 at 17:01 UTC
Good stuff.. It's faster :-) #!/usr/local/bin/perl -w use strict; use Benchmark; my $pat = 'foo'; $/=undef; open F,"<","/usr/share/dict/words"; my $mmap = <F>; close F; timethese( 100, { # 'orig' => sub { # while ($mmap =~ m/^([^\n]$pat[^\n]\n)/omg) { # warn $1; # } # }, 'fast' => sub { while ($mmap =~ m/\n?(.$pat.\n)/omg) { warn $1; } }, 'better' => sub { while ($mmap =~ m/(?:\A\|\n)(.$pat.\n)/omg) { warn $1; } }, } ) __END__ Benchmark: timing 100 iterations of better, fast... better: 26 wallclock secs (23.13 usr + 0.06 sys = 23.19 CPU) @ 4 +.31/s (n=100) fast: 41 wallclock secs (39.73 usr + 0.09 sys = 39.82 CPU) @ 2 +.51/s (n=100) [download] (original takes about 25 seconds per iteration on my machine) "What should it profit a man, if he should win a flame war, yet lose his cool?"*	[reply] [d/l]
Re^2: Surprisingly poor regex performance by sgifford (Prior) on Dec 13, 2004 at 17:12 UTC
Ah, thanks, that makes a big difference (though it's still slower than using `index`/`rindex`). Do you have any explanation for why? My understanding was that `^` meant pretty much the same thing as `\n?`. dragonchild's code is even faster, in some cases by a factor of 3.	[reply] [d/l] [select]
Re^3: Surprisingly poor regex performance by dragonchild (Archbishop) on Dec 13, 2004 at 17:59 UTC
Well, `^` doesn't mean the same as `\n?` at all. Like not even close. According to Programming Perl 3rd edition pages 150 and 159, `^`, when used with the `/m` modifier, means to match after embedded newlines or the beginning of the string. `\n?` means something akin to possibly match a newline, but maybe not. That doesn't provide the engine with any understanding of where to match. Had you said something like `/\n(.$pat.\n)/`, you would have been better off (except for not matching the first line). Of course, with /m, you should be able to do `/^(.$pat.)$/omsg` and it should be about as efficient as my regex. With regexes, it's always better to be as explicit as possible. This will allow the engine to make a number of optimizations. Some of those optimizations, as you have found out, can mean the difference between 2500 seconds and 23 seconds. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l] [select]
Re^4: Surprisingly poor regex performance by sgifford (Prior) on Dec 13, 2004 at 23:30 UTC
You're right of course; I shouldn't have said meant the same as; I meant would have the same effect as. Still, I think it's accurate to say that `^` means very nearly the same as your example: `(?:\A\|\n)`, so it's still quite surprising to me that yours is so much faster.	[reply] [d/l] [select]
Re^5: Surprisingly poor regex performance by dragonchild (Archbishop) on Dec 14, 2004 at 13:53 UTC
Re^6: Surprisingly poor regex performance (apples) by tye (Sage) on Dec 14, 2004 at 17:03 UTC
Some notes below your chosen depth have not been shown here
Re^3: Surprisingly poor regex performance by Joost (Canon) on Dec 13, 2004 at 17:22 UTC
Good question. I guess using ^ to match newlines is probably just not very well optimized in the engine. The usual trick to make regexes faster is to be as explicit as you can be (like spelling out the newlines) and it helped here. I don't know enough about perl's regex internals to really explain why it works as well as it does in this case. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]