Re: Surprisingly poor regex performance
by Joost (Canon) on Dec 13, 2004 at 16:52 UTC
|
How about anchoring the newline at the start?
while ($mmap =~ m/\n?(.*$pat.*\n)/omg) {
Note that .* is equivalent to [^\n]* if you're not using /s. I added the * at the end, because it fits the /dict/words lookup.
Removing the ? at the beginning makes it even faster, but then you need to handle the first line specially.
| [reply] [d/l] [select] |
|
|
What about m/(?:\A|\n)(.*$pat.*\n)/omg?
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] [d/l] |
|
|
Good stuff.. It's faster :-)
#!/usr/local/bin/perl -w
use strict;
use Benchmark;
my $pat = 'foo';
$/=undef;
open F,"<","/usr/share/dict/words";
my $mmap = <F>;
close F;
timethese( 100, {
# 'orig' => sub {
# while ($mmap =~ m/^([^\n]*$pat[^\n]\n)/omg) {
# warn $1;
# }
# },
'fast' => sub {
while ($mmap =~ m/\n?(.*$pat.*\n)/omg) {
warn $1;
}
},
'better' => sub {
while ($mmap =~ m/(?:\A|\n)(.*$pat.*\n)/omg) {
warn $1;
}
},
}
)
__END__
Benchmark: timing 100 iterations of better, fast...
better: 26 wallclock secs (23.13 usr + 0.06 sys = 23.19 CPU) @ 4
+.31/s (n=100)
fast: 41 wallclock secs (39.73 usr + 0.09 sys = 39.82 CPU) @ 2
+.51/s (n=100)
(original takes about 25 seconds per iteration on my machine)
| [reply] [d/l] |
|
|
| [reply] [d/l] [select] |
|
|
Well, ^ doesn't mean the same as \n? at all. Like not even close. According to Programming Perl 3rd edition pages 150 and 159, ^, when used with the /m modifier, means to match after embedded newlines or the beginning of the string.
\n? means something akin to possibly match a newline, but maybe not. That doesn't provide the engine with any understanding of where to match. Had you said something like /\n(.*$pat.*\n)/, you would have been better off (except for not matching the first line). Of course, with /m, you should be able to do /^(.*$pat.*)$/omsg and it should be about as efficient as my regex.
With regexes, it's always better to be as explicit as possible. This will allow the engine to make a number of optimizations. Some of those optimizations, as you have found out, can mean the difference between 2500 seconds and 23 seconds.
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
| [reply] |
Re: Surprisingly poor regex performance
by ikegami (Patriarch) on Dec 13, 2004 at 16:47 UTC
|
Shouldn't
[^\n]\n
be
[^\n]*\n
I don't know if that'll help your performance problem.
This post may not have anything to do with mmap. Did you try it without using mmap?
| [reply] [d/l] [select] |
|
|
Thanks; I tried several versions, and at some point lost my *. Putting it back in didn't help performance at all, though. And I agree it has nothing at all to do with mmap; that's just a convenient way of getting a 400K scalar variable with interesting contents.
| [reply] [d/l] [select] |
Re: Surprisingly poor regex performance
by tall_man (Parson) on Dec 14, 2004 at 18:24 UTC
|
There is no need to do all this anchoring. ".*" does not match newlines (unless you use the /s option), so the following will also work (and should be just as fast):
my $re = qr/(.*$pat.*)/;
while ($mmap =~ m/$re/g)
{
print $1,"\n";
}
| [reply] [d/l] |
|
|
You're right, that is roughly as fast as /(?:\A|\n)(.*$pat.*\n)/omg. But it's still slower than index/rindex.
| [reply] [d/l] [select] |
|
|
You were asking why the index/rindex is so good. It's because you're matching a pure literal string, which can use the highly-efficient Boyer-Moore search (the longer the literal string, the faster it tends to be).
For some reason, the "^" is not being optimized properly as an anchor for "/mg" searches. Leaving it out entirely, or substituting the alternate "(?:\A|\n)" pattern, results in searches that are much faster. That seems like a bug in the optimizer.
| [reply] [d/l] |
Re: Surprisingly poor regex performance
by dws (Chancellor) on Dec 14, 2004 at 08:11 UTC
|
My question is: is there anything I could have told Perl to give it a hint about how to perform the match...?
I'm wondering if a divide-and-conquer approach might give you the same effect, while being clearer to Perl about how to match without needing to backtrack.
while ( $mmap =~ /^(.*\n)/mg ) {
my $line = $1;
print $line if $line =~ /$pat/;
}
| [reply] [d/l] |
Re: Surprisingly poor regex performance
by dragonchild (Archbishop) on Dec 13, 2004 at 16:52 UTC
|
Well, are the two identical? From my reading, I don't think they are. The rewritten version finds "foo", then looks forward and backward to the first newline in both directions. The original version is looking for the beginning of a line after $pat ... ? That doesn't look right ...
Update: It helps to have more than one cup of coffee on Monday mornings ... *sighs*
/me really hates Mondays ...
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
Would study help this regexp's performance?
by Smylers (Pilgrim) on Dec 15, 2004 at 10:04 UTC
|
Would using study help here?
Note this is only a question, not a recommendation — I've never used study, but from its docs this sounds like a situation where it might help.
Smylers
Re-retitled by Smylers, so that it's grammatically correct again (and I don't look like an idiot who doesn't know to use apostrophes).
| [reply] [d/l] [select] |
|
|
study only really helps (ie. saves time) if you are going to be searching the studied string multiple times to offset the cost of the studying itself. And then only if your search term contains one or more characters that have rare occurance in the studied string.
I've wondered whether study could be updated to take a parameter n, where it then builds the table from groups of n chars, triplets being more unique than doublets, and they more so than individual chars.
Of course, the regex engine would need updating to make use of the information, and that's a very scary task to comtemplate.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] |
|
|
| [reply] |
|
|