Tanktalus has asked for the wisdom of the Perl Monks concerning the following question:
A discussion on CB earlier today with some monks whose identities I don't all recall (except for one high-ranking regular whose nick I won't bother with as it's immaterial) had the question: how to match /bar/ or /baz/. I piped up that, for maintainability reasons, I preferred /bar/ or /baz/. The regular piped up that he prefers /bar|baz/ - and said something about being faster. Which I thought was odd.
So I went about trying to Benchmark it. And I've been failing miserably to get anything reasonable. The only thing that I think I have is that /bar/ or /baz/ blows /bar|baz/ out of the water. But since I keep getting funny output, with one of them being wildly inconsistant, I thought I'd posit this to smarter monks than I, hopefully to show me the err of my ways. I did one more test prior to posting this, and I think I have orthogonal evidence that /bar/ or /baz/ actually is faster than /bar|baz/: use re 'debug' on both, and just see how many steps the regexp engine goes through. For the former, it goes through nearly no steps. For the latter, I lost track. That tells me that the former is faster for constant values. Probably different if it's /$bar/, but one step at a time, please ;-)
#!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese);
my $pre = 250; my $post = 250;
my $baz_match = ('b' x $pre) . 'baz' . ('b' x $post); my $bar_match = ('b' x $pre) . 'bar' . ('b' x $post); my $no_match = ('b' x $pre) . 'bza' . ('b' x $post);
my $i;
cmpthese( 750000, { using_or_match => sub { $i += ($baz_match =~ /bar/ or +$baz_match =~ /baz/) ? 1 : 0 }, using_or_nomatch => sub { $i += ($no_match =~ /bar/ or +$no_match =~ /baz/) ? 1 : 0 }, using_alt_match => sub { $i += ($baz_match =~ /bar|baz/ +) ? 1 : 0 }, using_alt_nomatch => sub { $i += ($no_match =~ /bar|baz/) + ? 1 : 0 }, }); print $i, $/;
I tried with various numbers: 50_000, 250_000, and, as above, 750_000. The addition of successes was partly to provide evidence that I was matching the right number of times, and partly because I was worried that the perl compiler was optimising away the using_or_* methods. Even then, I suspect that it is optimising something away. For 250_000, I get numbers like this:
(warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate using_alt_nomatch using_alt_match using_o +r_nomatch using_or_match using_alt_nomatch 125628/s -- -53% + -98% -99% using_alt_match 265957/s 112% -- + -97% -98% using_or_nomatch 8333333/s 6533% 3033% + -- -33% using_or_match 12500000/s 9850% 4600% + 50% -- 500000
There just has to be something wrong with that using_or_match. When I bump it up to 750_000, that number just gets more ridiculous (although it gets rid of one of the warnings). However, I'm not entirely sure how to counteract it.
Now, I realise that my overall tests are probably suspect, but having run into this, without knowing how to get around it, I'm not even sure how to proceed benchmarking the difference.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Benchmarking regex alternation
by diotalevi (Canon) on Jan 30, 2007 at 03:13 UTC | |
by demerphq (Chancellor) on Jan 30, 2007 at 15:58 UTC | |
by diotalevi (Canon) on Jan 30, 2007 at 16:10 UTC | |
|
Re: Benchmarking regex alternation
by GrandFather (Saint) on Jan 30, 2007 at 01:47 UTC | |
by demerphq (Chancellor) on Jan 30, 2007 at 14:37 UTC | |
by sgt (Deacon) on Jan 30, 2007 at 17:27 UTC |