comment on

A discussion on CB earlier today with some monks whose identities I don't all recall (except for one high-ranking regular whose nick I won't bother with as it's immaterial) had the question: how to match /bar/ or /baz/. I piped up that, for maintainability reasons, I preferred /bar/ or /baz/. The regular piped up that he prefers /bar|baz/ - and said something about being faster. Which I thought was odd.

So I went about trying to Benchmark it. And I've been failing miserably to get anything reasonable. The only thing that I think I have is that /bar/ or /baz/ blows /bar|baz/ out of the water. But since I keep getting funny output, with one of them being wildly inconsistant, I thought I'd posit this to smarter monks than I, hopefully to show me the err of my ways. I did one more test prior to posting this, and I think I have orthogonal evidence that /bar/ or /baz/ actually is faster than /bar|baz/: use re 'debug' on both, and just see how many steps the regexp engine goes through. For the former, it goes through nearly no steps. For the latter, I lost track. That tells me that the former is faster for constant values. Probably different if it's /$bar/, but one step at a time, please ;-)

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
[download]

my $pre = 250;
my $post = 250;
[download]

my $baz_match = ('b' x $pre) . 'baz' . ('b' x $post);
my $bar_match = ('b' x $pre) . 'bar' . ('b' x $post);
my $no_match = ('b' x $pre) . 'bza' . ('b' x $post);
[download]

my $i;
[download]

cmpthese( 750000,
         {
             using_or_match =>    sub { $i += ($baz_match =~ /bar/ or 
+$baz_match =~ /baz/) ? 1 : 0 },
             using_or_nomatch =>  sub { $i += ($no_match =~ /bar/  or 
+$no_match =~ /baz/) ? 1 : 0 },
             using_alt_match =>   sub { $i += ($baz_match =~ /bar|baz/
+) ? 1 : 0 },
             using_alt_nomatch => sub { $i += ($no_match =~ /bar|baz/)
+ ? 1 : 0 },
         });
print $i, $/;
[download]

I tried with various numbers: 50_000, 250_000, and, as above, 750_000. The addition of successes was partly to provide evidence that I was matching the right number of times, and partly because I was worried that the perl compiler was optimising away the using_or_* methods. Even then, I suspect that it is optimising something away. For 250_000, I get numbers like this:

            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
                        Rate using_alt_nomatch using_alt_match using_o
+r_nomatch using_or_match
using_alt_nomatch   125628/s                --            -53%        
+     -98%           -99%
using_alt_match     265957/s              112%              --        
+     -97%           -98%
using_or_nomatch   8333333/s             6533%           3033%        
+       --           -33%
using_or_match    12500000/s             9850%           4600%        
+      50%             --
500000
[download]

There just has to be something wrong with that using_or_match. When I bump it up to 750_000, that number just gets more ridiculous (although it gets rid of one of the warnings). However, I'm not entirely sure how to counteract it.

Now, I realise that my overall tests are probably suspect, but having run into this, without knowing how to get around it, I'm not even sure how to proceed benchmarking the difference.

In reply to Benchmarking regex alternation by Tanktalus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.