hazylife has asked for the wisdom of the Perl Monks concerning the following question:

So I'm working on a Perl version of grep for FB2 (XML based) ebooks, and just like grep it has a command line option that turns on case insensitive matching:

$re = $opts{i} ? qr/$pattern/i : qr/$pattern/;

Everything seems to be in place and working fine except for one major setback: matching a chunk of multibyte text against a pattern compiled with qr//i turns out to be 5-15 times slower than doing the same thing with m//i and just plain qr//, whereas one would expect the two approaches to be equally fast.

[ hmm, I'm a bit worried about the integrity of the UTF-8 (cyrillic) test string in this ]

#!/usr/bin/perl use strict; use utf8; use Benchmark ':all'; # the pattern doesn't even need to contain anything fancy # for the problem to manifest itself my $pattern = 'dumbest pattern ever'; # it's all about whether or not the /i flag is embedded into the regex my $re = qr/$pattern/; my $re_i = qr/$pattern/i; # qr//i causes a noticeable slowdown even when dealing with 7-bit (US- +ASCII) # strings, but this being multibyte seems to make things _a lot_ worse my $str = 'очень длинная строка ' x 10; my $count = 300_000; cmpthese($count, { 'qr//i+m//' => sub { $str =~ /$re_i/ }, 'qr//+m//i' => sub { $str =~ /$re/i }, 'qr//+m//' => sub { $str =~ /$re/ } });
$ ./qr-utf8.pl Rate qr//i+m// qr//+m//i qr//+m// qr//i+m// 10881/s -- -98% -98% qr//+m//i 505263/s 4543% -- -7% qr//+m// 540845/s 4870% 7% --

One possible way around this would be to altogether abandon qr//i and instead eval() my matching subroutine with all the necessary /i flags textually inlined (there are two m//'s and one s///), but that's still quite ugly. Any suggestions?

Replies are listed 'Best First'.
Re: qr//i versus m//i
by dave_the_m (Monsignor) on Feb 21, 2014 at 21:33 UTC
    /$qr/i isn't doing what you think it is. It's not doing a case-insensitive match. The '/i' doesn't override how the pattern has already been compiled. This is why it appears about as fast as the qr//+m// variant.

    Note also that case-insensitive matching is always going to be much slower than case-sensitive matching, especially when UNICODE is involved. And in particular, case-sensitive matching of fixed strings, such as in your benchmark, is specifically optimised (the main regex engine isn't actually called - instead a Boyer-Moore matcher is called instead). Which is why your benchmark makes the case-insensitive match look particularly bad.

    Dave.

      > The '/i' doesn't override how the pattern has already been compiled.

      Oh, I see. You're right Dave, my bad.

      > Note also that case-insensitive matching is always going to be much slower than case-sensitive matching, especially when UNICODE is involved.

      It really is awfully slow:

      $ RE=... # an actual regex here $ time fbgrep.pl "$RE" *fb2* >/dev/null real 0m32.912s user 0m32.374s sys 0m0.483s $ time fbgrep.pl -i "$RE" *fb2* >/dev/null real 2m17.575s user 2m16.359s sys 0m0.421s

      But I suppose there's not much to be done about it.

      Thanks for pointing me in the right direction!

Re: qr//i versus m//i
by Anonymous Monk on Feb 21, 2014 at 19:10 UTC

    Any suggestions?

    Don't worry about it (ignore the benchmark)

    See about m{(?i)$regex} option and see inside ack

      See also Re: Multiple Regex evaluations or one big one?

      Also if you use

      my $str1 = my $str2 = my $str3 = my $str4 = my $str = 'очень длинная строка ' x 10; ... 'qr//i+m//' => q{ $str1 =~ /$re_i/; return; }, 'qr//+m//i' => q{ $str2 =~ /$re/i; return; }, 'qr//+m//' => q{ $str3 =~ /$re/; return; }, 'qr' => q{ $str4 =~ $re; return; },
      You get (warning: too few iterations for a reliable count)

      If you my $count = -3; you get Timing is consistently zero in estimation loop, cannot benchmark. N=134217728

      What this means? don't worry about it :)

        Hah, another old benchmark i had laying around
        #!/usr/bin/perl -- use strict; use warnings; use Benchmark 'cmpthese'; my @small = 1; cmpthese( 500_000, { '__' => sub { for my $line ( @small ){ next if $line !~ /\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)/; } return; }, '/_'.$].'_/' => eval ' sub { for my $line ( @small ){ next if $line !~ /'. qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)} # 5.12.2 (?-xism: # 5.14.1 (?^: .'/; } return; } ', 'qr' => sub { my $re = qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)}; for my $line ( @small ){ next if $line !~ $re; } return; }, '/qr/' => sub { my $re = qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)}; for my $line ( @small ){ next if $line !~ /$re/; } return; }, ## 2014-02-21-11:33:11 'qr,//o' => do { our $gre = qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)}; sub { for my $line ( @small ){ next if $line !~ /$gre/; } return; }; }, }); __END__ Rate /qr/ qr qr,//o __ /_ +5.016001_/ /qr/ 198807/s -- -0% -48% -71% + -75% qr 198807/s 0% -- -48% -71% + -75% qr,//o 385505/s 94% 94% -- -43% + -52% __ 680272/s 242% 242% 76% -- + -15% /_5.016001_/ 800000/s 302% 302% 108% 18% + --