A discussion on CB earlier today with some monks whose identities I don't all recall (except for one high-ranking regular whose nick I won't bother with as it's immaterial) had the question: how to match /bar/ or /baz/. I piped up that, for maintainability reasons, I preferred /bar/ or /baz/. The regular piped up that he prefers /bar|baz/ - and said something about being faster. Which I thought was odd.
So I went about trying to Benchmark it. And I've been failing miserably to get anything reasonable. The only thing that I think I have is that /bar/ or /baz/ blows /bar|baz/ out of the water. But since I keep getting funny output, with one of them being wildly inconsistant, I thought I'd posit this to smarter monks than I, hopefully to show me the err of my ways. I did one more test prior to posting this, and I think I have orthogonal evidence that /bar/ or /baz/ actually is faster than /bar|baz/: use re 'debug' on both, and just see how many steps the regexp engine goes through. For the former, it goes through nearly no steps. For the latter, I lost track. That tells me that the former is faster for constant values. Probably different if it's /$bar/, but one step at a time, please ;-)
#!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese);
my $pre = 250; my $post = 250;
my $baz_match = ('b' x $pre) . 'baz' . ('b' x $post); my $bar_match = ('b' x $pre) . 'bar' . ('b' x $post); my $no_match = ('b' x $pre) . 'bza' . ('b' x $post);
my $i;
cmpthese( 750000, { using_or_match => sub { $i += ($baz_match =~ /bar/ or +$baz_match =~ /baz/) ? 1 : 0 }, using_or_nomatch => sub { $i += ($no_match =~ /bar/ or +$no_match =~ /baz/) ? 1 : 0 }, using_alt_match => sub { $i += ($baz_match =~ /bar|baz/ +) ? 1 : 0 }, using_alt_nomatch => sub { $i += ($no_match =~ /bar|baz/) + ? 1 : 0 }, }); print $i, $/;
I tried with various numbers: 50_000, 250_000, and, as above, 750_000. The addition of successes was partly to provide evidence that I was matching the right number of times, and partly because I was worried that the perl compiler was optimising away the using_or_* methods. Even then, I suspect that it is optimising something away. For 250_000, I get numbers like this:
(warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate using_alt_nomatch using_alt_match using_o +r_nomatch using_or_match using_alt_nomatch 125628/s -- -53% + -98% -99% using_alt_match 265957/s 112% -- + -97% -98% using_or_nomatch 8333333/s 6533% 3033% + -- -33% using_or_match 12500000/s 9850% 4600% + 50% -- 500000
There just has to be something wrong with that using_or_match. When I bump it up to 750_000, that number just gets more ridiculous (although it gets rid of one of the warnings). However, I'm not entirely sure how to counteract it.
Now, I realise that my overall tests are probably suspect, but having run into this, without knowing how to get around it, I'm not even sure how to proceed benchmarking the difference.
In reply to Benchmarking regex alternation by Tanktalus
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |