qr//i versus m//i

hazylife has asked for the wisdom of the Perl Monks concerning the following question:

So I'm working on a Perl version of grep for FB2 (XML based) ebooks, and just like grep it has a command line option that turns on case insensitive matching:

$re = $opts{i} ? qr/$pattern/i : qr/$pattern/;

Everything seems to be in place and working fine except for one major setback: matching a chunk of multibyte text against a pattern compiled with qr//i turns out to be 5-15 times slower than doing the same thing with m//i and just plain qr//, whereas one would expect the two approaches to be equally fast.

[ hmm, I'm a bit worried about the integrity of the UTF-8 (cyrillic) test string in this ]

#!/usr/bin/perl

use strict;
use utf8;
use Benchmark ':all';

# the pattern doesn't even need to contain anything fancy
# for the problem to manifest itself
my $pattern = 'dumbest pattern ever';

# it's all about whether or not the /i flag is embedded into the regex
my $re   = qr/$pattern/;
my $re_i = qr/$pattern/i;

# qr//i causes a noticeable slowdown even when dealing with 7-bit (US-
+ASCII)
# strings, but this being multibyte seems to make things _a lot_ worse
my $str = 'очень длинная строка ' x 10;


my $count = 300_000;

cmpthese($count, {
    'qr//i+m//' => sub { $str =~ /$re_i/ },
    'qr//+m//i' => sub { $str =~ /$re/i },
    'qr//+m//'  => sub { $str =~ /$re/ }
});
[download]

$ ./qr-utf8.pl
              Rate qr//i+m// qr//+m//i  qr//+m//
qr//i+m//  10881/s        --      -98%      -98%
qr//+m//i 505263/s     4543%        --       -7%
qr//+m//  540845/s     4870%        7%        --
[download]

One possible way around this would be to altogether abandon qr//i and instead eval() my matching subroutine with all the necessary /i flags textually inlined (there are two m//'s and one s///), but that's still quite ugly. Any suggestions?

Comment on qr//i versus m//i Select or Download Code

Replies are listed 'Best First'.
Re: qr//i versus m//i by dave_the_m (Monsignor) on Feb 21, 2014 at 21:33 UTC
`/$qr/i` isn't doing what you think it is. It's not doing a case-insensitive match. The '/i' doesn't override how the pattern has already been compiled. This is why it appears about as fast as the `qr//+m//` variant. Note also that case-insensitive matching is always going to be much slower than case-sensitive matching, especially when UNICODE is involved. And in particular, case-sensitive matching of fixed strings, such as in your benchmark, is specifically optimised (the main regex engine isn't actually called - instead a Boyer-Moore matcher is called instead). Which is why your benchmark makes the case-insensitive match look particularly bad. Dave.	[reply] [d/l] [select]
Re^2: qr//i versus m//i by hazylife (Monk) on Feb 22, 2014 at 12:26 UTC
> The '/i' doesn't override how the pattern has already been compiled. Oh, I see. You're right Dave, my bad. > Note also that case-insensitive matching is always going to be much slower than case-sensitive matching, especially when UNICODE is involved. It really is awfully slow: `$ RE=... # an actual regex here $ time fbgrep.pl "$RE" fb2 >/dev/null real 0m32.912s user 0m32.374s sys 0m0.483s $ time fbgrep.pl -i "$RE" fb2 >/dev/null real 2m17.575s user 2m16.359s sys 0m0.421s` [download] But I suppose there's not much to be done about it. Thanks for pointing me in the right direction!	[reply] [d/l]
Re: qr//i versus m//i by Anonymous Monk on Feb 21, 2014 at 19:10 UTC
Any suggestions? Don't worry about it (ignore the benchmark) See about m{(?i)$regex} option and see inside ack	[reply]
Re^2: qr//i versus m//i ( qr slow ) by Anonymous Monk on Feb 21, 2014 at 19:38 UTC
See also Re: Multiple Regex evaluations or one big one? Also if you use `my $str1 = my $str2 = my $str3 = my $str4 = my $str = 'очень длинная строка ' x 10; ... 'qr//i+m//' => q{ $str1 =~ /$re_i/; return; }, 'qr//+m//i' => q{ $str2 =~ /$re/i; return; }, 'qr//+m//' => q{ $str3 =~ /$re/; return; }, 'qr' => q{ $str4 =~ $re; return; },` [download] You get (warning: too few iterations for a reliable count) If you my $count = -3; you get Timing is consistently zero in estimation loop, cannot benchmark. N=134217728 What this means? don't worry about it :)	[reply] [d/l]
Re^3: qr//i versus m//i ( benchmark.qr.versus.inline.pl eval ) by Anonymous Monk on Feb 21, 2014 at 19:46 UTC
Hah, another old benchmark i had laying around #!/usr/bin/perl -- use strict; use warnings; use Benchmark 'cmpthese'; my @small = 1; cmpthese( 500_000, { '__' => sub { for my $line ( @small ){ next if $line !~ /\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)/; } return; }, '/_'.$].'_/' => eval ' sub { for my $line ( @small ){ next if $line !~ /'. qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)} # 5.12.2 (?-xism: # 5.14.1 (?^: .'/; } return; } ', 'qr' => sub { my $re = qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)}; for my $line ( @small ){ next if $line !~ $re; } return; }, '/qr/' => sub { my $re = qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)}; for my $line ( @small ){ next if $line !~ /$re/; } return; }, ## 2014-02-21-11:33:11 'qr,//o' => do { our $gre = qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)}; sub { for my $line ( @small ){ next if $line !~ /$gre/; } return; }; }, }); __END__ Rate /qr/ qr qr,//o __ /_ +5.016001_/ /qr/ 198807/s -- -0% -48% -71% + -75% qr 198807/s 0% -- -48% -71% + -75% qr,//o 385505/s 94% 94% -- -43% + -52% __ 680272/s 242% 242% 76% -- + -15% /_5.016001_/ 800000/s 302% 302% 108% 18% + -- [download]	[reply] [d/l]