Re: Matching First Character of Strings Efficiently

All,
Here are the preliminary results:

$ perl benchmark.pl
             Rate hardburn_1 L~R_3 L~R_2 tachyon hardburn_2 delirium k
+vale  tye  L~R BrowserUk
hardburn_1 2.15/s         --   -8%  -16%    -17%       -18%     -19%  
+-22% -23% -23%      -90%
L~R_3      2.35/s         9%    --   -8%     -9%       -11%     -12%  
+-15% -16% -16%      -90%
L~R_2      2.56/s        19%    9%    --     -1%        -3%      -4%  
+ -7%  -8%  -8%      -89%
tachyon    2.58/s        20%   10%    1%      --        -2%      -3%  
+ -6%  -7%  -7%      -89%
hardburn_2 2.63/s        22%   12%    3%      2%         --      -1%  
+ -5%  -6%  -6%      -88%
delirium   2.67/s        24%   14%    4%      3%         1%       --  
+ -3%  -4%  -4%      -88%
kvale      2.76/s        28%   17%    8%      7%         5%       3%  
+  --  -1%  -1%      -88%
tye        2.78/s        29%   19%    9%      8%         6%       4%  
+  1%   --  -0%      -88%
L~R        2.79/s        30%   19%    9%      8%         6%       5%  
+  1%   0%   --      -88%
BrowserUk  22.5/s       944%  857%  778%    770%       755%     743%  
+715% 708% 705%        --
[download]

Here is the script I used to build the list:

#!/usr/bin/perl
use strict;
use warnings;

open (LIST, '>', "list.rnd") or die "Unable to open list.rnd : $!";

for ( 1 .. 10_000 ) {
    my $length = int(rand 240) + 10;
    my $string = "";
    $string .= ('a' .. 'z')[ rand 26] while length $string < $length;
    print LIST "$string\n";
}
[download]

And finally, here is the benchmark itself

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark 'cmpthese';


my @list;
open (LIST, '<', 'list.rnd') or die "Unable to open list.rnd for readi
+ng : !";
while ( <LIST> ) {
    chomp;
    push @list, $_;
}

my $length = int(rand 240) + 10;
my $str_a = "";
$str_a .= ('a' .. 'z')[ rand 26] while length $str_a < $length;

sub expensive_function {
    my ($str_a, $str_b) = @_;
    my $foo;
    for ( split // , $str_a . $str_b) {
        $foo++;
    }
}

cmpthese -5, {
    'tachyon' => sub {
        use Inline C =>;
        for my $str_b ( @list ) {
            next if ! same_scan( $str_a, $str_b );
            expensive_function ( $str_a, $str_b );
        }
    },
    'L~R' => sub {
        for my $str_b ( @list ) {
            next if index($str_a, substr($str_b, 0, 1));
            expensive_function ( $str_a, $str_b );
        }
    },
    'hardburn_1' => sub {
        my ($a_1st, $rest) = split '', $str_a, 2;
        for my $str_b ( @list ) {
            my ($b_1st, $rest) = split '', $str_b, 2;
            next if $b_1st ne $a_1st;
            expensive_function ( $str_a, $str_b );
        }
    },
    'hardburn_2' => sub {
        my $rev_a = reverse $str_a;
        my $a_1st = chop $rev_a;
        for my $str_b ( @list ) {
            my $rev_b = reverse $str_b;
            my $b_1st = chop $rev_b;
            next if $b_1st ne $a_1st;
            expensive_function ( $str_a, $str_b );
        }
    },
    'delirium' => sub {
        my $fc = substr($str_a,0,1);
        for my $str_b ( @list ) {
            next if $str_b !~ /^$fc/;
            expensive_function ( $str_a, $str_b );
        }
    },
    'BrowserUk' => sub {
        my %list;
        @list{ map{ substr $_, 0, 1 } @list } = ();
        my $fc = substr $str_a, 0, 1;
        for my $str_b ( @list ) {
            next if exists $list{ $fc };
            expensive_function ( $str_a, $str_b );
        }
    },
    'tye' => sub {
        for my $str_b ( @list ) {
            next if ord( $str_a ) != ord( $str_b );
            expensive_function ( $str_a, $str_b );
        }
    },
    'kvale' => sub {
        my $fc = substr $str_a, 0, 1;
        for my $str_b ( @list ) {
            next if $fc ne substr $str_b, 0, 1;
            expensive_function ( $str_a, $str_b );
        }
    },
    'L~R_2' => sub {
        my $fc = unpack "a1" , $str_a;
        for my $str_b ( @list ) {
            next if $fc ne unpack "a1", $str_b;
            expensive_function ( $str_a, $str_b );
        }
    },
    'L~R_3' => sub {
        my ($fc_a) = $str_a =~ /^(.)/;
        for my $str_b ( @list ) {
            my ($fc_b) = $str_b =~ /^(.)/;
            next if $fc_a ne $fc_b;
            expensive_function ( $str_a, $str_b );
        }
    },
};
__END__
__C__
int same_scan(char* str1, char* str2)
{
  return str1[0] == str2[0] ? 1 : 0;
}
[download]

Note: I have not had a chance to verify that each one of these tests is calling expensive_function the same number of times so please check back if you are interested. I need to head home now.

Cheers - L~R

Update: See updated benchmark.

Comment on Re: Matching First Character of Strings Efficiently Select or Download Code

Replies are listed 'Best First'.
Re^2: Matching First Character of Strings Efficiently (benchmarking) by tye (Sage) on Mar 15, 2004 at 22:01 UTC
Including 'expensive_function' in a benchmark means that you are mostly measuring how often that function gets called (which isn't what you asked for) and reduces the differences in the remaining items to the level of noise (which a worse benchmark). I commented out the guts of expensive_function, shortened the names, added tye2 which does what I thought was the obvious optimization of pulling out the first char of $str_a at the top: `'tye2' => sub { my $f = ord($str_a); for my $str_b ( @list ) { next if $f != ord( $str_b ); expensive_function ( $str_a, $str_b ); } },` [download] and got these results: Rate hb1 hb2 L~R3 BUk L~R2 dlrm L~R tye1 kvale ty +e2 hb1 46.5/s -- -35% -35% -38% -60% -62% -70% -73% -73% -7 +4% hb2 71.4/s 53% -- -0% -6% -38% -42% -54% -59% -59% -6 +1% L~R3 71.4/s 53% 0% -- -6% -38% -42% -54% -59% -59% -6 +1% BUk 75.7/s 63% 6% 6% -- -35% -39% -51% -56% -56% -5 +8% L~R2 116/s 149% 62% 62% 53% -- -6% -25% -33% -33% -3 +6% dlrm 124/s 166% 73% 73% 63% 7% -- -20% -28% -29% -3 +2% L~R 155/s 233% 117% 117% 105% 34% 25% -- -10% -11% -1 +4% tye1 172/s 270% 141% 141% 128% 48% 39% 11% -- -1% - +5% kvale 174/s 274% 143% 143% 130% 50% 41% 12% 1% -- - +4% tye2 181/s 289% 154% 154% 139% 56% 46% 17% 5% 4% +-- [download] But I still consider such nano-optimization games to be more a waste than of value. (: - tye	[reply] [d/l] [select]
Re: Re^2: Matching First Character of Strings Efficiently (benchmarking) by Limbic~Region (Chancellor) on Mar 15, 2004 at 22:58 UTC
tye, Calculating the ordinal value of $str_a only once was an obvious optimization. I was hurrying to get out of the office - that will teach me. But I still consider such nano-optimization games to be more a waste than of value. I completely disagree with you. You don't seem to condemn golfing, which I feel is more of a waste (probably because I am no good at it). It is not like I was complaining that my solution wasn't fast enough. I feel that such games are of value for a few of reasons: It can be fun It can help disprove myths Figuring out the most efficient way once allows it to be reused in the future effortlessly. Now I am not saying one should sacrifice readability in order to save a few cycles or that a person should spend more programmer time than they will save in run time. As a periodic academic excersise, I see it has a lot of value. Of course, YMMV since you know a lot more than I do. Cheers - L~R	[reply]
Re^4: Matching First Character of Strings Efficiently (nano-opt) by tye (Sage) on Mar 15, 2004 at 23:15 UTC
Perl Golfing is not an unmixed blessing. But most people are smart enough to realize that golfed solutions aren't usually the best choice when writing code that you care about. Unfortunately, many people seem to think that nano-optimized code is the best choice when writing code that you care about. You reinforce this yourself above. So, we get people spending way too much time thinking about and arguing about whether they should use substr or ord and then spreading the joy by nano-optimizing other things and in the end they spend an extra 4 hours on their code so that it runs 0.002 seconds faster (yet the total run time is 2.4 seconds). It'd be much better if people chose the code that was clear or easy to maintain. Just look at how much time has been wasted by people trying to replace s/x//g with `tr`. The cult of nano-optimization is a seductive one. This is why several people have very strong aversion to micro-/nano-optimization. - tye	[reply]
Re: Re^2: Matching First Character of Strings Efficiently (benchmarking) by graff (Chancellor) on Mar 16, 2004 at 01:11 UTC
`Rate ... hb1 46.5/s ... ... tye2 181/s ...` [download] But I still consider such nano-optimization games to be more a waste than of value. Since when is a 4-to-1 speed-up "nano"? Granted, in the OP's case, maybe this particular 4:1 ratio would end up affecting only 10% of the overall run-time, once "expensive_process" (whatever it is) plays its role. Also, I wouldn't really want to bust my butt running an exhaustive test of 12 alternatives -- just compare the first (or current) idea against the next idea that looks like it could do better, in case the current idea seems to be taking longer than I like. But given that this sort of vetting task is likely to come up, if I had to do it over an input list of, say, 2 million items, I'd much prefer that it take 3 hours rather than 12. update: Having seen the replies to my post from tye and Aristotle, I have to say (sheepishly) thanks for the wake-up -- I should have known better than to take this sort of benchmark tabulation at "face value". In fact, I have used Benchmark myself on other problems -- but I opted for the "timethese" method (not "cmpthese"), which offers the advantage of showing the overall wall-clock difference as well as the "relative proportional" time difference for each approach. In one attempt, a comparison of two approaches showed "5.44/s" for one and "94.34/s" for the other -- like wow, huh? -- but the wall-clock difference was 2 sec vs. 9 sec... go figure. If an app really needs to be optimized, the first tool of choice should be profiling, of course, to see what segment of the code is eating all the time, so you can rethink that part first.	[reply] [d/l]
Re^4: Matching First Character of Strings Efficiently (4 times!) by tye (Sage) on Mar 16, 2004 at 02:21 UTC
Since when is a 4-to-1 speed-up "nano"? It is called micro- or nano-optimization because the operation being optimized is so small that it is likely to account for closer to 0.1% or less of the total run time (instead of the 10% you guessed) of most non-trivial scripts, not because Benchmark.pm measures a small percentage difference. Even if it accounted for 10% of the run time (extremely unlikely for an operation as small as comparing two characters), a 4-to-1 speed-up means only a 7.5% total speed-up. In reality, you are more likely to get something like a 0.075% speed up. I'd much prefer that it take 3 hours rather than 12 See, this is why this cult is so seductive. No matter how many times someone explains why this type of stuff is almost never very useful, people see "Four times faster!" and start drooling. Using your own (highly optimistic) numbers, a 12-hour run would take 11.1 hours (not 3). I mention this not because I think 0.9 hours is a short time (because the reality will probably be less than even a few minutes saved for a 12-hour run) but to show how "Four times faster!" takes over one's thoughts and one jumps to invalid conclusions. Even if you had a script that did (well, tried to do) nothing but compare lots of leading characters of strings, you'd probably find that the overhead would end up adding up to a lot more than the comparison time. Benchmark.pm does a lot of work to estimate the overhead and subtract it. I also think noone would try the slowest methods benchmarked here except in a rather desperate attempt to nano-optimize, so even the 4-to-1 speed-up isn't real. It is a near-to-4-to-1 slow-down of a desperate optimization attempt. - tye	[reply]
Re^4: Matching First Character of Strings Efficiently (benchmarking) by Aristotle (Chancellor) on Mar 16, 2004 at 03:21 UTC
Very well, now look at the other results. The solution both I and kvale proposed is the most straightforward, arguably most maintainable, and least "creative" one. It comes in third, much faster than the slowest, very "clever" solution, and achieves over 80% of the speed of tye's winning entry. Do you see anything wrong with your picture yet? The reality is that "boring" solutions chosen with a clear understanding of the task at hand are usually plenty fast enough. In my experience, they tend to achieve close to 85% of the possible performance. Given the points already made by tye about the reality of the impact of microoptimizations, there is very little worth in evaluating the obscure. Go with the maintainable, clear, concise solution and worry only if performance is actually insufficient. And even if you have gotten there, it is still more worthwhile to optimize your `expensive_function`s instead. Makeshifts last the longest.	[reply]
Re: Re: Matching First Character of Strings Efficiently by Roy Johnson (Monsignor) on Mar 15, 2004 at 21:45 UTC
It looks like the "BrowserUK" solution will never call expensive_function. If $fc ever showed up in @list, that test will always yield true. If not, it will call expensive_function every time through. The PerlMonk `tr///` Advocate	[reply]
Re: Re: Matching First Character of Strings Efficiently by tachyon (Chancellor) on Mar 16, 2004 at 04:09 UTC
Just looking at the faster versions and removing everything that is not the function (ie one true, one false test makes a set)... #!/usr/bin/perl use Inline C =>; use Benchmark 'cmpthese'; # Rate tchn tchn2 kval LR tye2 tye1 # tchn 556060/s -- -9% -19% -20% -50% -64% # tchn2 611783/s 10% -- -11% -12% -45% -60% # kval 690724/s 24% 13% -- -0% -38% -55% # LR 691904/s 24% 13% 0% -- -38% -55% # tye2 1114446/s 100% 82% 61% 61% -- -28% # tye1 1542899/s 177% 152% 123% 123% 38% -- my $str_a = 'a'x255; my $str_b = 'b'x255; my $str_c = $str_a; my ( $fc ); cmpthese -5, { 'tchn' => sub { $a = same_scan( $str_a, $str_b ); $a = same_scan( $str_a, $str_c ); }, 'tchn2' => sub { $a = same_scan2( $str_a, $str_b ); $a = same_scan2( $str_a, $str_c ); }, 'LR' => sub { $a = index($str_a, substr($str_b, 0, 1)); $a = index($str_a, substr($str_b, 0, 1)); }, 'tye1' => sub { $a = ord $str_a != ord $str_b; $a = ord $str_a != ord $str_c; }, 'tye2' => sub { $fc = ord $str_a; $a = $fc != ord( $str_b ); $fc = ord $str_a; $a = $fc != ord( $str_b ); }, 'kval' => sub { $fc = substr $str_a, 0, 1; $a = $fc ne substr $str_b, 0, 1; $fc = substr $str_a, 0, 1; $a = $fc ne substr $str_c, 0, 1; }, }; __END__ __C__ int same_scan(char* str1, char* str2) { return str1[0] == str2[0] ? 1 : 0; } int same_scan2(SV* str1, SV* str2) { return SvPV(str1, PL_na)[0] == SvPV(str2,PL_na)[0] ? 1 : 0; } [download] cheers tachyon	[reply] [d/l]
Re: Re: Matching First Character of Strings Efficiently by Limbic~Region (Chancellor) on Mar 15, 2004 at 22:47 UTC
tye mentioned in the CB that he suspected one of them was broke and expensive_function wasn't being called. I knew immediately he was right, but didn't have a chance to figure it out before heading home. Of course I figured it out on the way home but not in time for Roy Johnson to have already pointed it out. I took some of tye's comments and produced the following results: Read more... (3 kB) Cheers - L~R	[reply] [d/l] [select]
Re: Re: Matching First Character of Strings Efficiently by kral (Monk) on Mar 16, 2004 at 14:53 UTC
In my precedent message, I proposed a solution that was not considered maybe because it was in a test context. As a newbie in Perl, I would like to know if my solution is good or wrong. `my $fc = substr($str_a,0,1); for my $str_b ( @list ) { next if ord($fc) == ord((split(//, $str_b))[0]); expensive_function ( $str_a, $str_b ); }` [download] Tnx to all. :^) ---------- kral (sorry for my english)	[reply] [d/l]
Re: Re: Re: Matching First Character of Strings Efficiently by Limbic~Region (Chancellor) on Mar 16, 2004 at 15:02 UTC
kral, Actually, I just missed it when I was doing the benchmark. I would give you an award for having the most un-necessary steps. Let me explain why. You use substr to get 1st char even though ord returns numeric value of 1st char You use split to get a list of individual characters and throw them away using a slice You then use ord again for the test There is however, nothing wrong with it. Cheers - L~R	[reply]
Re: Re: Re: Re: Matching First Character of Strings Efficiently by kral (Monk) on Mar 16, 2004 at 16:59 UTC
You use substr to get 1st char even though ord returns numeric value of 1st char You use split to get a list of individual characters and throw them away using a slice I don't want to seem obstinate (I'm here for learning), but I made some tests: `#!/usr/bin/perl -w use strict; use Benchmark qw(:all); my $str_a = "dlajsdlkajslkdjasldjasljdaskjd"; cmpthese( -5, { 'Test1' => sub { ord( $str_a ) }, 'Test2' => sub { ord( ( split ( //, $str_a ) )[0] ) }, 'Test3' => sub { ord( substr $str_a, 0, 1 ) }, } ); __DATA__ Results: Rate Test2 Test3 Test1 Test2 6321/s -- -99% -100% Test3 516114/s 8065% -- -72% Test1 1827646/s 28815% 254% --` [download] ~~I convene that isn't much readable, but it seems like the ord function works better with only one character as argument.~~ Update: I have misunderstood the result, sorry! ---------- kral (sorry for my english)	[reply] [d/l]
Re: Re: Re: Re: Re: Matching First Character of Strings Efficiently by Limbic~Region (Chancellor) on Mar 16, 2004 at 17:18 UTC
Re: Re: Re: Re: Re: Re: Matching First Character of Strings Efficiently by kral (Monk) on Mar 16, 2004 at 20:13 UTC