Re^5: "advanced" Perl functions and maintainability

I've done the benchmarks before. Here:

#!/usr/bin/perl -w
use strict;
use Benchmark 'cmpthese';

my $string       = "this is a string" x 300;
my $short_string = "this is a short string";

cmpthese(10000000, {
    'index' => sub {
        my $res;

        $res = index($string, "this", 0);
        $res = index($string, "string");
        $res = rindex($string, "string");
    },
    regex   => sub {
        my $res;

        $res = $string =~ /^this/;
        $res = $string =~ /string/;
        $res = $string =~ /string$/;
    },
    shortindex => sub {
        my $res;

        $res = index($short_string, "this", 0);
        $res = index($short_string, "short");
        $res = rindex($short_string, "string");
    },
    shortregex => sub {
        my $res;

        $res = $short_string =~ /^this/;
        $res = $short_string =~ /short/;
        $res = $short_string =~ /string$/;
    },
    substrindex => sub {
        my $res;

        my $substr = "this is";
        $res = index($short_string, $substr);
    },
    substrregex => sub {
        my $res;

        my $substr = "this is";
        $res = $short_string =~ /\Q$substr\E/;
    }
});
[download]

Benchmark: timing 10000000 iterations of index, regex, shortindex, sho
+rtregex, substrindex, substrregex...
     index: 18 wallclock secs (17.29 usr +  0.00 sys = 17.29 CPU) @ 57
+8369.00/s (n=10000000)
     regex: 32 wallclock secs (32.52 usr +  0.00 sys = 32.52 CPU) @ 30
+7503.08/s (n=10000000)
shortindex: 15 wallclock secs (15.15 usr +  0.00 sys = 15.15 CPU) @ 66
+0066.01/s (n=10000000)
shortregex: 26 wallclock secs (27.14 usr +  0.00 sys = 27.14 CPU) @ 36
+8459.84/s (n=10000000)
substrindex:  9 wallclock secs ( 9.39 usr +  0.00 sys =  9.39 CPU) @ 1
+064962.73/s (n=10000000)
substrregex: 17 wallclock secs (16.64 usr +  0.00 sys = 16.64 CPU) @ 6
+00961.54/s (n=10000000)
                 Rate  regex shortregex index substrregex shortindex s
+ubstrindex
regex        307503/s     --       -17%  -47%        -49%       -53%  
+      -71%
shortregex   368460/s    20%         --  -36%        -39%       -44%  
+      -65%
index        578369/s    88%        57%    --         -4%       -12%  
+      -46%
substrregex  600962/s    95%        63%    4%          --        -9%  
+      -44%
shortindex   660066/s   115%        79%   14%         10%         --  
+      -38%
substrindex 1064963/s   246%       189%   84%         77%        61%  
+        --
[download]

The regex is almost always slower, but usually not by that much.

janitored by ybiC: Replaced almost-allways-inappropriate <pre> tags around benchmark results with <code> tags, to avoid annoying lateral scrolling

Comment on Re^5: "advanced" Perl functions and maintainability Select or Download Code

Replies are listed 'Best First'.
Re^6: "advanced" Perl functions and maintainability by Anonymous Monk on Dec 13, 2004 at 14:13 UTC
Quite inconclusive. Considering the what the optimizer does, it highly depends on your data whether index() or a regex is faster. It also depends whether there is a match, where the match is (if any), and whether the string has been studied. Here's some more data: #!/usr/bin/perl use strict; use warnings; use Benchmark 'cmpthese'; our $string = "abcd" x 1000; $string .= "e"; $string .= "abcd" x 1000; our $study = $string; our $pass = "abcde"; our $fail1 = "foo12"; our $fail2 = "abdce"; study $study; cmpthese(-1, { index_pass => 'index($string, $pass)', regex_pass => '$string =~ /$pass/', study_pass => '$study =~ /$pass/', }); print ("\n\n"); cmpthese(-1, { index_fail1 => 'index($string, $fail1)', index_fail2 => 'index($string, $fail2)', regex_fail1 => '$string =~ /$fail1/', regex_fail2 => '$string =~ /$fail2/', study_fail1 => '$study =~ /$fail1/', study_fail2 => '$study =~ /$fail2/', }); __END__ Rate index_pass study_pass regex_pass index_pass 38331/s -- -6% -69% study_pass 40960/s 7% -- -67% regex_pass 125463/s 227% 206% -- Rate index_fail2 study_fail2 index_fail1 regex_fail2 +regex_fail1 study_fail1 index_fail2 27306/s -- -0% -48% -56% + -64% -99% study_fail2 27307/s 0% -- -48% -56% + -64% -99% index_fail1 52608/s 93% 93% -- -15% + -31% -98% regex_fail2 61837/s 126% 126% 18% -- + -19% -98% regex_fail1 75918/s 178% 178% 44% 23% + -- -98% study_fail1 3412032/s 12396% 12395% 6386% 5418% + 4394% -- [download] Note that with this data, index is slower than a regex. The fastness of 'study_fail1' is explained by the fact that the string we are looking for, 'foo12', contains letters not present in the string - and since a studied string has a histogram attacked of its letter frequencies, no searching needs to be performed at all.	[reply] [d/l]
Re^7: "advanced" Perl functions and maintainability by William G. Davis (Friar) on Dec 13, 2004 at 15:11 UTC
Right, except no one ever uses study, and most Perl hackers still pull out m// even if they only need to search for or one or two constant substrings. Also, you didn't quote meta characters in the pattern strings.	[reply]
Re^8: "advanced" Perl functions and maintainability by itub (Priest) on Dec 13, 2004 at 16:37 UTC
Even without using `study`, the speed advantage of `index` is almost lost when the match is at the end of a long string. `my $string = "this is a string" x 300 . "xyz"; cmpthese(500000, { 'index' => sub { my $res; $res = index($string, "xyz"); }, regex => sub { my $res; $res = $string =~ /xyz/; }, });` [download] Rate regex index regex 97087/s -- -2% index 98619/s 2% -- most Perl hackers still pull out m//... Ok, I agree that `index` is faster in some cases. However, I think there are good reasons for the behavior of most Perl hackers: m// "scales" better in terms of uses. You can use it for the simplest things as well as for very complex things. It is practical and idiomatic. To me, that sounds like a description of Perl itself. If you use it for your constant string to begin with and then you decide you need metacharacters, the change is smaller. Ok, this is a minor advantage, as the change wouldn't be that big anyway. If you are using regular expressions elsewhere in the code, the code looks more consistent, and that makes it more readable. Worring about the speed of `index` vs `m//` may be premature optimization. If you wanted the fastest possible solution you might not want to use Perl in the first place. Even if you want the fastest possible Perl implementation it is always better to make the code correct and readable first, and optimize the hot spots later. I realize that some of this reasons (particularly the last one) agree with your argument for using `for` and `push` instead of `map`. I just wasn't sure that your assertion regarding `index` was correct, as my benchmarks had shown the opposite in the past. Now I see that it depends on the situation. Regarding the readability of `map` vs `for`, I would say that a distinct advantage of `map` is that it documents the purpose of the loop right at the top (when used properly). As soon as you see the `map` keyword you'll know that you are building a list and you'll know where it is being stored; with `for`, you have to wait until you see the `push` to see the true purpose of the loop. Which approach is better depends on the intention of the coder, and I agree that the size of the block may be a factor to consider. The problem is that the exact line between `map` and `for` is blurry, partly a matter of style and personal preference.	[reply] [d/l] [select]
Re^8: "advanced" Perl functions and maintainability by Anonymous Monk on Dec 13, 2004 at 15:43 UTC
Well, you were the one claiming they shouldn't use m// because that's more work for perl than using index, posting a benchmark to back up your claim. I posted a benchmark using different data which shows index losing. index isn't always faster, so Perl hackers aren't "wrong" or even inefficient for using m// over index. Also, you didn't quote meta characters in the pattern strings. Yes I did. If you think I'm wrong, please point out an unquated meta character in one of the pattern strings. If there were unquoted meta characters in the pattern strings, the benchmark wouldn't be fair, would it? index doesn't know metacharacters.	[reply]
Re^9: "advanced" Perl functions and maintainability by William G. Davis (Friar) on Dec 13, 2004 at 16:19 UTC
Re^10: "advanced" Perl functions and maintainability by diotalevi (Canon) on Dec 13, 2004 at 21:27 UTC
Re^6: "advanced" Perl functions and maintainability by Aristotle (Chancellor) on Dec 20, 2004 at 03:51 UTC
Your benchmark seems quite confused and, well, all but useful. Your distinction between a short and long string is useless and deceptive because there are matches near both ends of the test strings for all of the searches you run, and there are no non-match data sets at all. `index( $foo, $bar, 0 )` is no different from `index( $foo, $bar )`, neither of which does the same as `/^$bar/`, just like `rindex( $foo, $bar )` is something entirely different from `/$bar$/`. You are comparing apples and meteors. Note that putting multiple different benchmarks in a single table only serves the purpose of casting further confusion onto the data. You are benchmarking three things; run three benchmarks, look at three tables. `use re 'debug';` and compile a few regexen sometime, you'll see that the regex engine turns the trivial cases into pretty much a plain index. There's more fixed overhead for invoking the engine rather than just calling that function, of course, but on large data sets that's negligible. Makeshifts last the longest.	[reply] [d/l]