in reply to Re^2: "advanced" Perl functions and maintainability
in thread "advanced" Perl functions and maintainability

Right, but "$string =~ /$substring/" requires more work on the part of the interpreter than plain "index($string, $substring)" does, yet only a minority of Perl hackers bother to touch index/rindex at all.

  • Comment on Re^3: "advanced" Perl functions and maintainability

Replies are listed 'Best First'.
Re^4: "advanced" Perl functions and maintainability
by itub (Priest) on Dec 13, 2004 at 05:18 UTC
    Does it, really? As far as I can tell, the optimizations in the regular expression engine nicely cover the kinds of matches that could be done with index. Try running some benchmarks and see.

      I've done the benchmarks before. Here:

      #!/usr/bin/perl -w use strict; use Benchmark 'cmpthese'; my $string = "this is a string" x 300; my $short_string = "this is a short string"; cmpthese(10000000, { 'index' => sub { my $res; $res = index($string, "this", 0); $res = index($string, "string"); $res = rindex($string, "string"); }, regex => sub { my $res; $res = $string =~ /^this/; $res = $string =~ /string/; $res = $string =~ /string$/; }, shortindex => sub { my $res; $res = index($short_string, "this", 0); $res = index($short_string, "short"); $res = rindex($short_string, "string"); }, shortregex => sub { my $res; $res = $short_string =~ /^this/; $res = $short_string =~ /short/; $res = $short_string =~ /string$/; }, substrindex => sub { my $res; my $substr = "this is"; $res = index($short_string, $substr); }, substrregex => sub { my $res; my $substr = "this is"; $res = $short_string =~ /\Q$substr\E/; } });

       

      Benchmark: timing 10000000 iterations of index, regex, shortindex, sho +rtregex, substrindex, substrregex... index: 18 wallclock secs (17.29 usr + 0.00 sys = 17.29 CPU) @ 57 +8369.00/s (n=10000000) regex: 32 wallclock secs (32.52 usr + 0.00 sys = 32.52 CPU) @ 30 +7503.08/s (n=10000000) shortindex: 15 wallclock secs (15.15 usr + 0.00 sys = 15.15 CPU) @ 66 +0066.01/s (n=10000000) shortregex: 26 wallclock secs (27.14 usr + 0.00 sys = 27.14 CPU) @ 36 +8459.84/s (n=10000000) substrindex: 9 wallclock secs ( 9.39 usr + 0.00 sys = 9.39 CPU) @ 1 +064962.73/s (n=10000000) substrregex: 17 wallclock secs (16.64 usr + 0.00 sys = 16.64 CPU) @ 6 +00961.54/s (n=10000000) Rate regex shortregex index substrregex shortindex s +ubstrindex regex 307503/s -- -17% -47% -49% -53% + -71% shortregex 368460/s 20% -- -36% -39% -44% + -65% index 578369/s 88% 57% -- -4% -12% + -46% substrregex 600962/s 95% 63% 4% -- -9% + -44% shortindex 660066/s 115% 79% 14% 10% -- + -38% substrindex 1064963/s 246% 189% 84% 77% 61% + --

      The regex is almost always slower, but usually not by that much.

      janitored by ybiC: Replaced almost-allways-inappropriate <pre> tags around benchmark results with <code> tags, to avoid annoying lateral scrolling

        Quite inconclusive. Considering the what the optimizer does, it highly depends on your data whether index() or a regex is faster. It also depends whether there is a match, where the match is (if any), and whether the string has been studied. Here's some more data:
        #!/usr/bin/perl use strict; use warnings; use Benchmark 'cmpthese'; our $string = "abcd" x 1000; $string .= "e"; $string .= "abcd" x 1000; our $study = $string; our $pass = "abcde"; our $fail1 = "foo12"; our $fail2 = "abdce"; study $study; cmpthese(-1, { index_pass => 'index($string, $pass)', regex_pass => '$string =~ /$pass/', study_pass => '$study =~ /$pass/', }); print ("\n\n"); cmpthese(-1, { index_fail1 => 'index($string, $fail1)', index_fail2 => 'index($string, $fail2)', regex_fail1 => '$string =~ /$fail1/', regex_fail2 => '$string =~ /$fail2/', study_fail1 => '$study =~ /$fail1/', study_fail2 => '$study =~ /$fail2/', }); __END__ Rate index_pass study_pass regex_pass index_pass 38331/s -- -6% -69% study_pass 40960/s 7% -- -67% regex_pass 125463/s 227% 206% -- Rate index_fail2 study_fail2 index_fail1 regex_fail2 +regex_fail1 study_fail1 index_fail2 27306/s -- -0% -48% -56% + -64% -99% study_fail2 27307/s 0% -- -48% -56% + -64% -99% index_fail1 52608/s 93% 93% -- -15% + -31% -98% regex_fail2 61837/s 126% 126% 18% -- + -19% -98% regex_fail1 75918/s 178% 178% 44% 23% + -- -98% study_fail1 3412032/s 12396% 12395% 6386% 5418% + 4394% --
        Note that with this data, index is slower than a regex. The fastness of 'study_fail1' is explained by the fact that the string we are looking for, 'foo12', contains letters not present in the string - and since a studied string has a histogram attacked of its letter frequencies, no searching needs to be performed at all.

        Your benchmark seems quite confused and, well, all but useful.

        Your distinction between a short and long string is useless and deceptive because there are matches near both ends of the test strings for all of the searches you run, and there are no non-match data sets at all. index( $foo, $bar, 0 ) is no different from index( $foo, $bar ), neither of which does the same as /^$bar/, just like rindex( $foo, $bar ) is something entirely different from /$bar$/. You are comparing apples and meteors.

        Note that putting multiple different benchmarks in a single table only serves the purpose of casting further confusion onto the data. You are benchmarking three things; run three benchmarks, look at three tables.

        use re 'debug'; and compile a few regexen sometime, you'll see that the regex engine turns the trivial cases into pretty much a plain index. There's more fixed overhead for invoking the engine rather than just calling that function, of course, but on large data sets that's negligible.

        Makeshifts last the longest.