Re: Did the inefficiency of /i get fixed?

All these posts have used regexes that are too simple. Here's my benchmark (on 5.8.3):

use Benchmark 'cmpthese';

my $y = "HELLO world" x 100;

cmpthese(-5, {
  cc_simple => sub {
    $y =~ /[hH][eE][lL][lL][oO]/;
  },
  cc_complex => sub {
    $y =~ /\w+ [hH][eE][lL][lL][oO] \w+/;
  },
  i_simple => sub {
    $y =~ /hello/i;
  },
  i_complex => sub {
    $y =~ /\w+ hello \w+/i;
  },
  simple => sub {
    $y =~ /HELLO/;
  },
  complex => sub {
    $y =~ /\w+ HELLO \w+/;
  },
  simple_fail => sub {
    $y =~ /hELLO/;
  },
  complex_fail => sub {
    $y =~ /\w+ hELLO \w+/;
  },
});

__END__
                  Rate cc_complex i_complex complex complex_fail simpl
+e_fail cc_simple i_simple simple
cc_complex     13973/s         --       -2%    -82%         -83%      
+  -87%      -99%     -99%   -99%
i_complex      14271/s         2%        --    -81%         -83%      
+  -87%      -99%     -99%   -99%
complex        77103/s       452%      440%      --          -7%      
+  -31%      -93%     -93%   -94%
complex_fail   82560/s       491%      479%      7%           --      
+  -26%      -92%     -93%   -94%
simple_fail   111335/s       697%      680%     44%          35%      
+    --      -90%     -90%   -92%
cc_simple    1065985/s      7529%     7370%   1283%        1191%      
+  857%        --      -5%   -23%
i_simple     1120558/s      7919%     7752%   1353%        1257%      
+  906%        5%       --   -19%
simple       1377591/s      9759%     9553%   1687%        1569%      
+ 1137%       29%      23%     --
[download]

You can see that, for the very simple regex, the /i regex is faster than the charclass regex -- but this is because, even with case-sensitivity on, the regex /hello/ is a simple Boyer-Moore search (compare "simple" and "i_simple"). Once you get into a complex regex -- one that requires actual pattern matching -- you can see that /i and charclasses operate about the same (compare "cc_complex" and "i_complex").

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Comment on Re: Did the inefficiency of /i get fixed? Select or Download Code

Replies are listed 'Best First'.
Re: Re: Did the inefficiency of /i get fixed? by itub (Priest) on May 19, 2004 at 14:22 UTC
The inefficiency is true and very important if you are using Unicode strings, because lowercasing Unicode characters is slow. Consider this modified benchmark: `use utf8; use Benchmark 'cmpthese'; my $y = "HELLO ΑΙΝΣΪ" x 100; # make sure your script is encoded in UTF-8 when you save it! # ... the rest of the code is the same as in the parent node` [download] Results: Rate i_complex cc_complex complex complex_fail i_simple simple_fail simple cc_simple i_complex 672/s -- -68% -70% -71% -98% -100% -100% -100% cc_complex 2078/s 209% -- -9% -10% -94% -99% -100% -100% complex 2278/s 239% 10% -- -2% -94% -99% -100% -100% complex_fail 2317/s 245% 12% 2% -- -94% -99% -100% -100% i_simple 35860/s 5234% 1626% 1474% 1448% -- -79% -95% -95% simple_fail 168298/s 24935% 8001% 7289% 7164% 369% -- -76% -78% simple 703114/s 104489% 33744% 30771% 30248% 1861% 318% -- -10% cc_simple 780570/s 116011% 37472% 34172% 33591% 2077% 364% 11% -- Character classes are 3 times faster than /i for the complex case and 21 times faster for the simple case!	[reply] [d/l]

Replies are listed 'Best First'.

Re: Re: Did the inefficiency of /i get fixed?
by itub (Priest) on May 19, 2004 at 14:22 UTC

The inefficiency is true and very important if you are using Unicode strings, because lowercasing Unicode characters is slow. Consider this modified benchmark:

use utf8;
use Benchmark 'cmpthese';

my $y = "HELLO ΑΙΝΣΪ" x 100;
# make sure your script is encoded in UTF-8 when you save it!

# ... the rest of the code is the same as in the parent node
[download]

Results:

                 Rate i_complex cc_complex complex complex_fail i_simple simple_fail simple cc_simple
i_complex       672/s        --       -68%    -70%         -71%     -98%       -100%  -100%     -100%
cc_complex     2078/s      209%         --     -9%         -10%     -94%        -99%  -100%     -100%
complex        2278/s      239%        10%      --          -2%     -94%        -99%  -100%     -100%
complex_fail   2317/s      245%        12%      2%           --     -94%        -99%  -100%     -100%
i_simple      35860/s     5234%      1626%   1474%        1448%       --        -79%   -95%      -95%
simple_fail  168298/s    24935%      8001%   7289%        7164%     369%          --   -76%      -78%
simple       703114/s   104489%     33744%  30771%       30248%    1861%        318%     --      -10%
cc_simple    780570/s   116011%     37472%  34172%       33591%    2077%        364%    11%        --

Character classes are 3 times faster than /i for the complex case and 21 times faster for the simple case!

[reply]
[d/l]