Surprisingly poor regex performance

sgifford has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Surprisingly poor regex performance by Joost (Canon) on Dec 13, 2004 at 16:52 UTC
How about anchoring the newline at the start? `while ($mmap =~ m/\n?(.$pat.\n)/omg) {` [download] Note that `.` is equivalent to `[^\n]` if you're not using /s. I added the * at the end, because it fits the /dict/words lookup. Removing the ? at the beginning makes it even faster, but then you need to handle the first line specially. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^2: Surprisingly poor regex performance by dragonchild (Archbishop) on Dec 13, 2004 at 16:55 UTC
What about `m/(?:\A\|\n)(.$pat.\n)/omg`? Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l]
Re^3: Surprisingly poor regex performance by Joost (Canon) on Dec 13, 2004 at 17:01 UTC
Good stuff.. It's faster :-) #!/usr/local/bin/perl -w use strict; use Benchmark; my $pat = 'foo'; $/=undef; open F,"<","/usr/share/dict/words"; my $mmap = <F>; close F; timethese( 100, { # 'orig' => sub { # while ($mmap =~ m/^([^\n]$pat[^\n]\n)/omg) { # warn $1; # } # }, 'fast' => sub { while ($mmap =~ m/\n?(.$pat.\n)/omg) { warn $1; } }, 'better' => sub { while ($mmap =~ m/(?:\A\|\n)(.$pat.\n)/omg) { warn $1; } }, } ) __END__ Benchmark: timing 100 iterations of better, fast... better: 26 wallclock secs (23.13 usr + 0.06 sys = 23.19 CPU) @ 4 +.31/s (n=100) fast: 41 wallclock secs (39.73 usr + 0.09 sys = 39.82 CPU) @ 2 +.51/s (n=100) [download] (original takes about 25 seconds per iteration on my machine) "What should it profit a man, if he should win a flame war, yet lose his cool?"*	[reply] [d/l]
Re^2: Surprisingly poor regex performance by sgifford (Prior) on Dec 13, 2004 at 17:12 UTC
Ah, thanks, that makes a big difference (though it's still slower than using `index`/`rindex`). Do you have any explanation for why? My understanding was that `^` meant pretty much the same thing as `\n?`. dragonchild's code is even faster, in some cases by a factor of 3.	[reply] [d/l] [select]
Re^3: Surprisingly poor regex performance by dragonchild (Archbishop) on Dec 13, 2004 at 17:59 UTC
Well, `^` doesn't mean the same as `\n?` at all. Like not even close. According to Programming Perl 3rd edition pages 150 and 159, `^`, when used with the `/m` modifier, means to match after embedded newlines or the beginning of the string. `\n?` means something akin to possibly match a newline, but maybe not. That doesn't provide the engine with any understanding of where to match. Had you said something like `/\n(.$pat.\n)/`, you would have been better off (except for not matching the first line). Of course, with /m, you should be able to do `/^(.$pat.)$/omsg` and it should be about as efficient as my regex. With regexes, it's always better to be as explicit as possible. This will allow the engine to make a number of optimizations. Some of those optimizations, as you have found out, can mean the difference between 2500 seconds and 23 seconds. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l] [select]
Re^4: Surprisingly poor regex performance by sgifford (Prior) on Dec 13, 2004 at 23:30 UTC
Re^5: Surprisingly poor regex performance by dragonchild (Archbishop) on Dec 14, 2004 at 13:53 UTC
Some notes below your chosen depth have not been shown here
Re^3: Surprisingly poor regex performance by Joost (Canon) on Dec 13, 2004 at 17:22 UTC
Good question. I guess using ^ to match newlines is probably just not very well optimized in the engine. The usual trick to make regexes faster is to be as explicit as you can be (like spelling out the newlines) and it helped here. I don't know enough about perl's regex internals to really explain why it works as well as it does in this case. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: Surprisingly poor regex performance by ikegami (Patriarch) on Dec 13, 2004 at 16:47 UTC
Shouldn't `[^\n]\n` be `[^\n]*\n` I don't know if that'll help your performance problem. This post may not have anything to do with mmap. Did you try it without using mmap?	[reply] [d/l] [select]
Re^2: Surprisingly poor regex performance by sgifford (Prior) on Dec 13, 2004 at 16:57 UTC
Thanks; I tried several versions, and at some point lost my `*`. Putting it back in didn't help performance at all, though. And I agree it has nothing at all to do with `mmap`; that's just a convenient way of getting a 400K scalar variable with interesting contents.	[reply] [d/l] [select]
Re: Surprisingly poor regex performance by tall_man (Parson) on Dec 14, 2004 at 18:24 UTC
There is no need to do all this anchoring. "." does not match newlines (unless you use the /s option), so the following will also work (and should be just as fast): `my $re = qr/(.$pat.*)/; while ($mmap =~ m/$re/g) { print $1,"\n"; }` [download]	[reply] [d/l]
Re^2: Surprisingly poor regex performance by sgifford (Prior) on Dec 14, 2004 at 18:34 UTC
You're right, that is roughly as fast as `/(?:\A\|\n)(.$pat.\n)/omg`. But it's still slower than `index`/`rindex`.	[reply] [d/l] [select]
Re^3: Surprisingly poor regex performance by tall_man (Parson) on Dec 14, 2004 at 21:49 UTC
You were asking why the index/rindex is so good. It's because you're matching a pure literal string, which can use the highly-efficient Boyer-Moore search (the longer the literal string, the faster it tends to be). For some reason, the "^" is not being optimized properly as an anchor for "/mg" searches. Leaving it out entirely, or substituting the alternate `"(?:\A\|\n)"` pattern, results in searches that are much faster. That seems like a bug in the optimizer.	[reply] [d/l]
Re: Surprisingly poor regex performance by dws (Chancellor) on Dec 14, 2004 at 08:11 UTC
My question is: is there anything I could have told Perl to give it a hint about how to perform the match...? I'm wondering if a divide-and-conquer approach might give you the same effect, while being clearer to Perl about how to match without needing to backtrack. `while ( $mmap =~ /^(.*\n)/mg ) { my $line = $1; print $line if $line =~ /$pat/; }` [download]	[reply] [d/l]
Re: Surprisingly poor regex performance by dragonchild (Archbishop) on Dec 13, 2004 at 16:52 UTC
Well, are the two identical? From my reading, I don't think they are. The rewritten version finds "foo", then looks forward and backward to the first newline in both directions. The original version is looking for the beginning of a line after $pat ... ? That doesn't look right ... Update: It helps to have more than one cup of coffee on Monday mornings ... sighs /me really hates Mondays ... Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Would study help this regexp's performance? by Smylers (Pilgrim) on Dec 15, 2004 at 10:04 UTC
Would using `study` help here? Note this is only a question, not a recommendation — I've never used `study`, but from its docs this sounds like a situation where it might help. Smylers Retitled by davido. Re-retitled by Smylers, so that it's grammatically correct again (and I don't look like an idiot who doesn't know to use apostrophes).	[reply] [d/l] [select]
Re: Would study help this regexps performance? by BrowserUk (Patriarch) on Dec 15, 2004 at 10:41 UTC
study only really helps (ie. saves time) if you are going to be searching the studied string multiple times to offset the cost of the studying itself. And then only if your search term contains one or more characters that have rare occurance in the studied string. I've wondered whether study could be updated to take a parameter n, where it then builds the table from groups of n chars, triplets being more unique than doublets, and they more so than individual chars. Of course, the regex engine would need updating to make use of the information, and that's a very scary task to comtemplate. Examine what is said, not who speaks. The end of an era! "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^2: Would study help this regexps performance? by sgifford (Prior) on Dec 15, 2004 at 15:45 UTC
study only really helps (ie. saves time) if you are going to be searching the studied string multiple times to offset the cost of the studying itself And in this case, I'm searching the string only once, so it almost certainly would have hurt rather than helped.	[reply]
Re^3: Would study help this regexps performance? by BrowserUk (Patriarch) on Dec 15, 2004 at 16:24 UTC