pbijnens has asked for the wisdom of the Perl Monks concerning the following question:

I have this very strange problem.

In some cases this substition is very slow (several seconds on a short string!)

s{^\x{feff}*<!-- HINT.*? -->}{};
Only the "some cases" seems very arbitrary, and I want to understand why.

Here is some code and below the benchmark results.

#!/usr/bin/perl use strict; use warnings; use Encode; use Benchmark qw(:all); binmode(STDOUT, ":utf8"); #my $xmlcmt = "<!-- HINT: Some XML comments to be deleted -->"; my $xmlcmt = ""; my $str_lat = "My \x{20AC}0.02 of info"; my $str_ara = decode_utf8("\xd8\xa7\xd9\x84\xd8\xa3\xd9\x86\xd8\xb4\xd +8\xb7\xd8\xa9"); print Encode::is_utf8($str_lat) ? "Is_utf8" : "Not_utf8", "\t"; print + $str_lat, "\n"; print Encode::is_utf8($str_ara) ? "Is_utf8" : "Not_utf8", "\t"; print + $str_ara, "\n"; timethese(2, { "ara nocmt 1step" => sub { s1($str_ara) }, "ara nocmt 2step" => sub { s2($str_ara) }, "ara w-cmt 1step" => sub { s1($xmlcmt . $str_ara) }, "ara w-cmt 2step" => sub { s2($xmlcmt . $str_ara) }, "lat nocmt 1step" => sub { s1($str_lat) }, "lat nocmt 2step" => sub { s2($str_lat) }, "lat w-cmt 1step" => sub { s1($xmlcmt . $str_lat) }, "lat w-cmt 2step" => sub { s2($xmlcmt . $str_lat) }, }); # Do the replacement in 1 step sub s1 { my $s = shift; #$s =~ s{^\x{feff}*<!-- .*? -->}{}; # This is fast #$s =~ s{^\x{feff}*<!-- Hi.*? -->}{}; # This is fast #$s =~ s{^\x{feff}?<!-- HINT.*? -->}{}; # This is fast $s =~ s{^\x{feff}*<!-- HINT.*? -->}{}; # This could be slow } # Do the replacement in 2 steps sub s2 { my $s = shift; $s =~ s{^\x{feff}*}{}; $s =~ s{^<!-- HINT.*? -->}{}; }
Here are the benchmarking results (on CentOS 5.6: "This is perl, v5.8.8 built for x86_64-linux-thread-multi"):
ara nocmt 1step:  5 wallclock secs ( 4.72 usr +  0.04 sys =  4.76 CPU) @  0.42/s (n=2)
ara nocmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
ara w-cmt 1step:  5 wallclock secs ( 4.75 usr +  0.00 sys =  4.75 CPU) @  0.42/s (n=2)
ara w-cmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat nocmt 1step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat nocmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat w-cmt 1step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat w-cmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)

Notice how the first and the third test take almost 5 seconds for 2 iterations only.

When I change arbitrary conditions, the substition is fast again:

What exactly makes this slow?

I really like to understand what is going on here.

Replies are listed 'Best First'.
Re: Very slow regex substitution on Unicode string
by tchrist (Pilgrim) on May 25, 2011 at 19:40 UTC
    What’s going on here is that you’ve hit an old 5.8 bug from more than five years ago. If you change the first argument to timethese to 0 so that it calculates them for you, here are the results:
    % perl5.8 ~/906678.pl
    Is_utf8 My €0.02 of info
    Is_utf8 الأنشطة
    Benchmark: running ara nocmt 1step, ara nocmt 2step, ara w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat w-cmt 1step, lat w-cmt 2step for at least 3 CPU seconds...
    ara nocmt 1step:  4 wallclock secs ( 3.58 usr +  0.01 sys =  3.59 CPU) @  0.56/s (n=2)
                (warning: too few iterations for a reliable count)
    ara nocmt 2step:  3 wallclock secs ( 3.14 usr +  0.01 sys =  3.15 CPU) @ 983039.68/s (n=3096575)
    ara w-cmt 1step:  4 wallclock secs ( 3.58 usr +  0.01 sys =  3.59 CPU) @  0.56/s (n=2)
                (warning: too few iterations for a reliable count)
    ara w-cmt 2step:  3 wallclock secs ( 3.11 usr +  0.01 sys =  3.12 CPU) @ 840413.46/s (n=2622090)
    lat nocmt 1step:  4 wallclock secs ( 3.13 usr +  0.02 sys =  3.15 CPU) @ 1664822.54/s (n=5244191)
    lat nocmt 2step:  3 wallclock secs ( 3.12 usr +  0.01 sys =  3.13 CPU) @ 923365.81/s (n=2890135)
    lat w-cmt 1step:  4 wallclock secs ( 3.09 usr +  0.00 sys =  3.09 CPU) @ 1437420.39/s (n=4441629)
    lat w-cmt 2step:  4 wallclock secs ( 3.11 usr +  0.01 sys =  3.12 CPU) @ 789479.17/s (n=2463175)
    
    % perl5.10 ~/906678.pl
    Is_utf8 My €0.02 of info
    Is_utf8 الأنشطة
    Benchmark: running ara nocmt 1step, ara nocmt 2step, ara w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat w-cmt 1step, lat w-cmt 2step for at least 3 CPU seconds...
    ara nocmt 1step:  3 wallclock secs ( 3.17 usr +  0.02 sys =  3.19 CPU) @ 1592576.80/s (n=5080320)
    ara nocmt 2step:  4 wallclock secs ( 3.16 usr +  0.00 sys =  3.16 CPU) @ 803844.94/s (n=2540150)
    ara w-cmt 1step:  3 wallclock secs ( 3.14 usr + -0.01 sys =  3.13 CPU) @ 1481335.46/s (n=4636580)
    ara w-cmt 2step:  4 wallclock secs ( 3.27 usr +  0.01 sys =  3.28 CPU) @ 695634.45/s (n=2281681)
    lat nocmt 1step:  3 wallclock secs ( 3.25 usr +  0.01 sys =  3.26 CPU) @ 1511156.44/s (n=4926370)
    lat nocmt 2step:  4 wallclock secs ( 3.20 usr +  0.01 sys =  3.21 CPU) @ 791326.79/s (n=2540159)
    lat w-cmt 1step:  3 wallclock secs ( 3.13 usr +  0.01 sys =  3.14 CPU) @ 1336099.36/s (n=4195352)
    lat w-cmt 2step:  3 wallclock secs ( 3.18 usr +  0.01 sys =  3.19 CPU) @ 686363.01/s (n=2189498)
    
    % perl5.12 ~/906678.pl
    Is_utf8 My €0.02 of info
    Is_utf8 الأنشطة
    Benchmark: running ara nocmt 1step, ara nocmt 2step, ara w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat w-cmt 1step, lat w-cmt 2step for at least 3 CPU seconds...
    ara nocmt 1step:  4 wallclock secs ( 3.21 usr +  0.00 sys =  3.21 CPU) @ 1937381.31/s (n=6218994)
    ara nocmt 2step:  3 wallclock secs ( 3.11 usr +  0.01 sys =  3.12 CPU) @ 766259.94/s (n=2390731)
    ara w-cmt 1step:  3 wallclock secs ( 3.07 usr +  0.01 sys =  3.08 CPU) @ 1481614.61/s (n=4563373)
    ara w-cmt 2step:  3 wallclock secs ( 3.12 usr +  0.01 sys =  3.13 CPU) @ 670184.03/s (n=2097676)
    lat nocmt 1step:  2 wallclock secs ( 3.05 usr +  0.01 sys =  3.06 CPU) @ 1992276.47/s (n=6096366)
    lat nocmt 2step:  4 wallclock secs ( 3.08 usr +  0.01 sys =  3.09 CPU) @ 750254.05/s (n=2318285)
    lat w-cmt 1step:  3 wallclock secs ( 3.07 usr +  0.01 sys =  3.08 CPU) @ 1647204.22/s (n=5073389)
    lat w-cmt 2step:  3 wallclock secs ( 3.08 usr +  0.01 sys =  3.09 CPU) @ 678859.55/s (n=2097676)
    
    % perl5.14 ~/906678.pl
    Is_utf8 My €0.02 of info
    Is_utf8 الأنشطة
    Benchmark: running ara nocmt 1step, ara nocmt 2step, ara w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat w-cmt 1step, lat w-cmt 2step for at least 3 CPU seconds...
    ara nocmt 1step:  4 wallclock secs ( 3.00 usr +  0.01 sys =  3.01 CPU) @ 2013724.92/s (n=6061312)
    ara nocmt 2step:  3 wallclock secs ( 3.26 usr +  0.01 sys =  3.27 CPU) @ 828593.88/s (n=2709502)
    ara w-cmt 1step:  4 wallclock secs ( 3.16 usr +  0.00 sys =  3.16 CPU) @ 1714874.68/s (n=5419004)
    ara w-cmt 2step:  4 wallclock secs ( 3.32 usr +  0.01 sys =  3.33 CPU) @ 678051.05/s (n=2257910)
    lat nocmt 1step:  2 wallclock secs ( 3.08 usr +  0.01 sys =  3.09 CPU) @ 2063201.94/s (n=6375294)
    lat nocmt 2step:  4 wallclock secs ( 3.30 usr + -0.01 sys =  3.29 CPU) @ 823556.84/s (n=2709502)
    lat w-cmt 1step:  3 wallclock secs ( 3.17 usr + -0.01 sys =  3.16 CPU) @ 1607695.89/s (n=5080319)
    lat w-cmt 2step:  4 wallclock secs ( 3.19 usr +  0.01 sys =  3.20 CPU) @ 684215.00/s (n=2189488)
    

    See? No more problem. Not only that, it’s getting faster, too.

Re: Very slow regex substitution on Unicode string
by petdance (Parson) on May 26, 2011 at 04:23 UTC
    In addition to what tchrist said, it's worth noting that the parsimonious matching can be much slower than other choices because of all the backtracking it has to do.

    I don't know what your data looks like, but if you can make limits on the regex based on what you know about the data, then that will help. For instance, in the HTML comments you're capturing HINT followed by almost anything. Does it need to be that liberal? Or are you really only looking for HINT followed by some non-whitespace? If you can change the HINT.*? part of your regex to HINT\S* you will get much faster times.

    Of course, that's just a guess on my part, but anything you can do to help the regex matcher limit its range of work, the better.

    xoxo,
    Andy

      There is not so much backtracking going on I think.

      I need to delete the whole xml-comment when it starts with the space+HINT. I think the fastest way is:

      s/<!-- HINT.*? -->//
      The ".*?" should do minimal matching, i.e. stopping as soon as the string-so-far is followed by the litteral space-dash-dash-greaterthan. This (should) makes it do no backtracking at all, I believe. I do not think I can make it faster, or am I wrong?

      Also the removal of the initial BOM should not lead to much backtracking. The regex engine should look at most at one character ahead in this case (I think).

      s/^\x{feff}*//
      And note that the whole regex was anchored at the beginning of the string too; again leading to much less work.

      I'll just wait for CentOS 6.x to come out, which has perl5.10, and until then just split up the regex. Good enough for me.

        You should get minimal backtracking. At the start of each word in the comment it will backtrack over the space between words.

        You should look into building your own up-to-date perl. I've found it to be remarkably easy on both Solaris & Linux boxes, and you get to use the latest & greatest perl for you own work. Just leave the system version where it is, and put your own on the #! line of your programs or put it earlier in your personal PATH.

        TJD

Re: Very slow regex substitution on Unicode string
by Anonymous Monk on May 25, 2011 at 19:54 UTC
    I'm getting ...
    Is_utf8 My €0.02 of info Is_utf8 &#1575;&#1604;&#1571;&#1606;&#1588;&#1591;&#1577; Benchmark: timing 2 iterations of ara nocmt 1step, ara nocmt 2step, ar +a w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat + w-cmt 1step, lat w-cmt 2step... ara nocmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) ara nocmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) ara w-cmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) ara w-cmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) lat nocmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) lat nocmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) lat w-cmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) lat w-cmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count)
    ... on my slightly out of date 5.12 install.

    The fine folks who make this lovely little language of ours have done a *lot* of work on the regex & unicode stuff since in both 5.10 & 5.12. Perhaps you're being bitten by a bug related to your perl version?

    TJD

      Bah, got in 2nd with an inferior reply.

      TJD