comment on

Heh, a crash course in benchmarking eh? ;-)

Well personally my analysis of why the || version is faster is this (and i await correction by japhy or some other god...)

Lets say the string we are matching against is "test100"

so with /(est|\d)/ it would first check 'e' against 't' fail, and then test 'e' to is if its in [0-9] (its not smart enough to know it isn't unfortunately) which it isnt of course, (also this test is much more expensive than testing against a constant char). It would then test 'e' against 'e' and (so on and) find the match. So the result is 4 constant char tests and 1 char class test, we also have some minor overhead doing the option transition.

With /est/ it would test the 'e' against the 't' fail the match and then match as before. Result 4 constant char tests, no char class tests, no option transitions. (And this ignores potential optimizations that the boyer-moore algorithm may or may not contribute to this case). Since the first match succeeds it the second regex never executes and no char class matching is performed.

I reversed the order of the text so that instead of being test100 it would be 100test (as well as adding reverse versions of the tests), and even then your "faster" version _is_ actually faster. This makes me think that boyer moore _is_ actually kicking in and providing more optimizations than I describe here. Also its interesting that slower2 is about the same speed as faster2.

use Benchmark 'cmpthese';
$match=0;
$string="";

cmpthese -5, {
slower => <<'EOFCODE',
for my $i (0..100) {
    $::string="${i}test";
    if($::string =~ /(?:est|\d)/){
        $::match++;
    }
}
EOFCODE

slower2 => <<'EOFCODE',
for my $i (0..100) {
    $::string="${i}test";
    if($::string =~ /(?:\d|est)/){
        $::match++;
    }
}
EOFCODE
faster => <<'EOFCODE',
for my $i (0..100) {
    $::string="${i}test";
    if ($::string =~ /est/ || $::string =~ /\d/){
        $::match++;
    }
}
EOFCODE
faster2 => <<'EOFCODE',
for my $i (0..100) {
    $::string="${i}test";
    if ($::string =~ /\d/ || $::string =~ /est/){
        $::match++;
    }
}
EOFCODE
};

__END__
Benchmark: running faster, faster2, slower, slower2, each for at least
+ 5 CPU seconds...
    faster:  5 wallclock secs ( 5.36 usr +  0.00 sys =  5.36 CPU) @ 31
+57.84/s (n=16926)
   faster2:  5 wallclock secs ( 5.39 usr +  0.00 sys =  5.39 CPU) @ 28
+59.21/s (n=15414)
    slower:  5 wallclock secs ( 5.44 usr +  0.00 sys =  5.44 CPU) @ 26
+73.59/s (n=14539)
   slower2:  6 wallclock secs ( 5.36 usr +  0.00 sys =  5.36 CPU) @ 28
+81.69/s (n=15443)
          Rate  slower faster2 slower2  faster
slower  2674/s      --     -6%     -7%    -15%
faster2 2859/s      7%      --     -1%     -9%
slower2 2882/s      8%      1%      --     -9%
faster  3158/s     18%     10%     10%      --
[download]

Although you should keep in mind that this benchmark is pretty misleading. It doesnt test against "live" data, nor does it include scenarios where there is no match, things like that. When you use some more "realistic" data you get the opposite results.

use Benchmark 'cmpthese';
$match=0;
@strings=qw(
            abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
            abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz_est
            abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz_0
            est_abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
            0_abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
            abcdefghijklmnopqrstuvwxyzabcdefghijk_est_lmnopqrstuvwxyz
            abcdefghijklmnopqrstuvwxyzabcdefghijk_0_lmnopqrstuvwxyz
           );


cmpthese -5, {
slower => <<'EOFCODE',
for my $str (@::strings) {
    if($::string =~ /(?:est|\d)/){
        $::match++;
    }
}
EOFCODE

slower2 => <<'EOFCODE',
for my $str (@::strings) {
    if($::string =~ /(?:\d|est)/){
        $::match++;
    }
}
EOFCODE
faster => <<'EOFCODE',
for my $str (@::strings) {
    if ($::string =~ /est/ || $::string =~ /\d/){
        $::match++;
    }
}
EOFCODE
faster2 => <<'EOFCODE',
for my $str (@::strings) {
    if ($::string =~ /\d/ || $::string =~ /est/){
        $::match++;
    }
}
EOFCODE
};

__END__
[download]

Benchmark: running faster, faster2, slower, slower2, each for at least
+ 5 CPU seconds...
    faster:  4 wallclock secs ( 5.25 usr +  0.00 sys =  5.25 CPU) @ 10
+3837.90/s (n=545149)
   faster2:  6 wallclock secs ( 5.24 usr +  0.00 sys =  5.24 CPU) @ 10
+3939.83/s (n=544125)
    slower:  5 wallclock secs ( 5.25 usr +  0.00 sys =  5.25 CPU) @ 13
+3356.57/s (n=700122)
   slower2:  5 wallclock secs ( 5.28 usr +  0.00 sys =  5.28 CPU) @ 13
+2805.94/s (n=701481)
            Rate  faster faster2 slower2  slower
faster  103838/s      --     -0%    -22%    -22%
faster2 103940/s      0%      --    -22%    -22%
slower2 132806/s     28%     28%      --     -0%
slower  133357/s     28%     28%      0%      --
[download]

HTH

Yves / DeMerphq
---
Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

In reply to Re: RE performance by demerphq
in thread RE performance by Vennis

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.