Very slow regex substitution on Unicode string

pbijnens has asked for the wisdom of the Perl Monks concerning the following question:

I have this very strange problem.

In some cases this substition is very slow (several seconds on a short string!)

   s{^\x{feff}*<!-- HINT.*? -->}{};
[download]

Only the "some cases" seems very arbitrary, and I want to understand why.

Here is some code and below the benchmark results.

#!/usr/bin/perl

use strict;
use warnings;

use Encode;
use Benchmark qw(:all);

binmode(STDOUT, ":utf8");

#my $xmlcmt = "<!-- HINT: Some XML comments to be deleted -->";
my $xmlcmt = ""; 
my $str_lat = "My \x{20AC}0.02 of info";
my $str_ara = decode_utf8("\xd8\xa7\xd9\x84\xd8\xa3\xd9\x86\xd8\xb4\xd
+8\xb7\xd8\xa9");


print Encode::is_utf8($str_lat) ? "Is_utf8" : "Not_utf8", "\t";  print
+ $str_lat, "\n";
print Encode::is_utf8($str_ara) ? "Is_utf8" : "Not_utf8", "\t";  print
+ $str_ara, "\n";

timethese(2, {
    "ara nocmt 1step" => sub { s1($str_ara) },
    "ara nocmt 2step" => sub { s2($str_ara) },
    "ara w-cmt 1step" => sub { s1($xmlcmt . $str_ara) },
    "ara w-cmt 2step" => sub { s2($xmlcmt . $str_ara) },
    "lat nocmt 1step" => sub { s1($str_lat) },
    "lat nocmt 2step" => sub { s2($str_lat) },
    "lat w-cmt 1step" => sub { s1($xmlcmt . $str_lat) },
    "lat w-cmt 2step" => sub { s2($xmlcmt . $str_lat) },
 
});

# Do the replacement in 1 step
sub s1
{
    my $s = shift;
    #$s =~ s{^\x{feff}*<!-- .*? -->}{};         # This is fast
    #$s =~ s{^\x{feff}*<!-- Hi.*? -->}{};       # This is fast
    #$s =~ s{^\x{feff}?<!-- HINT.*? -->}{};     # This is fast
    $s =~ s{^\x{feff}*<!-- HINT.*? -->}{};      # This could be slow
}

# Do the replacement in 2 steps
sub s2
{
    my $s = shift;
    $s =~ s{^\x{feff}*}{};
    $s =~ s{^<!-- HINT.*? -->}{};
}
[download]

Here are the benchmarking results (on CentOS 5.6: "This is perl, v5.8.8 built for x86_64-linux-thread-multi"):

ara nocmt 1step:  5 wallclock secs ( 4.72 usr +  0.04 sys =  4.76 CPU) @  0.42/s (n=2)
ara nocmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
ara w-cmt 1step:  5 wallclock secs ( 4.75 usr +  0.00 sys =  4.75 CPU) @  0.42/s (n=2)
ara w-cmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat nocmt 1step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat nocmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat w-cmt 1step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat w-cmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)

Notice how the first and the third test take almost 5 seconds for 2 iterations only.

When I change arbitrary conditions, the substition is fast again:

Splitting the substition into two steps makes it fast, as the benchmark code above shows.
Changing the "\x{feff}*" (zero-or-more) into "\x{feff}?" (zero-or-one) makes it fast.
Change the word "HINT" into someting else, and it's fast again (lowercase "hint" is still slow).
Changing the Arabic source string could make it faster (some arabic strings are fast, some are slow -- I did not yet find a latin-script string that is slow).
similar as previous item: really adding someting to delete, and thereby changing the Arabic string, makes it fast. (use the uncommented $xmlcmt in the benchmark code)

What exactly makes this slow?

Is there something wrong with the Arabic string? (I don't understand a single letter of it, I'm only the programmer handling the files.)
Is it the unicode nature of the string or regex and some bugs in perl 5.8.8 on CentOS 5.6?
Is it reproducable on other CentOS 5.6 machines (not maintained by me :-) )?

I really like to understand what is going on here.

Comment on Very slow regex substitution on Unicode string Select or Download Code

Replies are listed 'Best First'.
Re: Very slow regex substitution on Unicode string by tchrist (Pilgrim) on May 25, 2011 at 19:40 UTC
What’s going on here is that you’ve hit an old 5.8 bug from more than five years ago. If you change the first argument to `timethese` to `0` so that it calculates them for you, here are the results: % perl5.8 ~/906678.pl Is_utf8 My €0.02 of info Is_utf8 الأنشطة Benchmark: running ara nocmt 1step, ara nocmt 2step, ara w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat w-cmt 1step, lat w-cmt 2step for at least 3 CPU seconds... ara nocmt 1step: 4 wallclock secs ( 3.58 usr + 0.01 sys = 3.59 CPU) @ 0.56/s (n=2) (warning: too few iterations for a reliable count) ara nocmt 2step: 3 wallclock secs ( 3.14 usr + 0.01 sys = 3.15 CPU) @ 983039.68/s (n=3096575) ara w-cmt 1step: 4 wallclock secs ( 3.58 usr + 0.01 sys = 3.59 CPU) @ 0.56/s (n=2) (warning: too few iterations for a reliable count) ara w-cmt 2step: 3 wallclock secs ( 3.11 usr + 0.01 sys = 3.12 CPU) @ 840413.46/s (n=2622090) lat nocmt 1step: 4 wallclock secs ( 3.13 usr + 0.02 sys = 3.15 CPU) @ 1664822.54/s (n=5244191) lat nocmt 2step: 3 wallclock secs ( 3.12 usr + 0.01 sys = 3.13 CPU) @ 923365.81/s (n=2890135) lat w-cmt 1step: 4 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ 1437420.39/s (n=4441629) lat w-cmt 2step: 4 wallclock secs ( 3.11 usr + 0.01 sys = 3.12 CPU) @ 789479.17/s (n=2463175) % perl5.10 ~/906678.pl Is_utf8 My €0.02 of info Is_utf8 الأنشطة Benchmark: running ara nocmt 1step, ara nocmt 2step, ara w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat w-cmt 1step, lat w-cmt 2step for at least 3 CPU seconds... ara nocmt 1step: 3 wallclock secs ( 3.17 usr + 0.02 sys = 3.19 CPU) @ 1592576.80/s (n=5080320) ara nocmt 2step: 4 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 803844.94/s (n=2540150) ara w-cmt 1step: 3 wallclock secs ( 3.14 usr + -0.01 sys = 3.13 CPU) @ 1481335.46/s (n=4636580) ara w-cmt 2step: 4 wallclock secs ( 3.27 usr + 0.01 sys = 3.28 CPU) @ 695634.45/s (n=2281681) lat nocmt 1step: 3 wallclock secs ( 3.25 usr + 0.01 sys = 3.26 CPU) @ 1511156.44/s (n=4926370) lat nocmt 2step: 4 wallclock secs ( 3.20 usr + 0.01 sys = 3.21 CPU) @ 791326.79/s (n=2540159) lat w-cmt 1step: 3 wallclock secs ( 3.13 usr + 0.01 sys = 3.14 CPU) @ 1336099.36/s (n=4195352) lat w-cmt 2step: 3 wallclock secs ( 3.18 usr + 0.01 sys = 3.19 CPU) @ 686363.01/s (n=2189498) % perl5.12 ~/906678.pl Is_utf8 My €0.02 of info Is_utf8 الأنشطة Benchmark: running ara nocmt 1step, ara nocmt 2step, ara w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat w-cmt 1step, lat w-cmt 2step for at least 3 CPU seconds... ara nocmt 1step: 4 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ 1937381.31/s (n=6218994) ara nocmt 2step: 3 wallclock secs ( 3.11 usr + 0.01 sys = 3.12 CPU) @ 766259.94/s (n=2390731) ara w-cmt 1step: 3 wallclock secs ( 3.07 usr + 0.01 sys = 3.08 CPU) @ 1481614.61/s (n=4563373) ara w-cmt 2step: 3 wallclock secs ( 3.12 usr + 0.01 sys = 3.13 CPU) @ 670184.03/s (n=2097676) lat nocmt 1step: 2 wallclock secs ( 3.05 usr + 0.01 sys = 3.06 CPU) @ 1992276.47/s (n=6096366) lat nocmt 2step: 4 wallclock secs ( 3.08 usr + 0.01 sys = 3.09 CPU) @ 750254.05/s (n=2318285) lat w-cmt 1step: 3 wallclock secs ( 3.07 usr + 0.01 sys = 3.08 CPU) @ 1647204.22/s (n=5073389) lat w-cmt 2step: 3 wallclock secs ( 3.08 usr + 0.01 sys = 3.09 CPU) @ 678859.55/s (n=2097676) % perl5.14 ~/906678.pl Is_utf8 My €0.02 of info Is_utf8 الأنشطة Benchmark: running ara nocmt 1step, ara nocmt 2step, ara w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat w-cmt 1step, lat w-cmt 2step for at least 3 CPU seconds... ara nocmt 1step: 4 wallclock secs ( 3.00 usr + 0.01 sys = 3.01 CPU) @ 2013724.92/s (n=6061312) ara nocmt 2step: 3 wallclock secs ( 3.26 usr + 0.01 sys = 3.27 CPU) @ 828593.88/s (n=2709502) ara w-cmt 1step: 4 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 1714874.68/s (n=5419004) ara w-cmt 2step: 4 wallclock secs ( 3.32 usr + 0.01 sys = 3.33 CPU) @ 678051.05/s (n=2257910) lat nocmt 1step: 2 wallclock secs ( 3.08 usr + 0.01 sys = 3.09 CPU) @ 2063201.94/s (n=6375294) lat nocmt 2step: 4 wallclock secs ( 3.30 usr + -0.01 sys = 3.29 CPU) @ 823556.84/s (n=2709502) lat w-cmt 1step: 3 wallclock secs ( 3.17 usr + -0.01 sys = 3.16 CPU) @ 1607695.89/s (n=5080319) lat w-cmt 2step: 4 wallclock secs ( 3.19 usr + 0.01 sys = 3.20 CPU) @ 684215.00/s (n=2189488) See? No more problem. Not only that, it’s getting faster, too.	[reply] [d/l]
Re: Very slow regex substitution on Unicode string by petdance (Parson) on May 26, 2011 at 04:23 UTC
In addition to what tchrist said, it's worth noting that the parsimonious matching can be much slower than other choices because of all the backtracking it has to do. I don't know what your data looks like, but if you can make limits on the regex based on what you know about the data, then that will help. For instance, in the HTML comments you're capturing HINT followed by almost anything. Does it need to be that liberal? Or are you really only looking for HINT followed by some non-whitespace? If you can change the `HINT.?` part of your regex to `HINT\S` you will get much faster times. Of course, that's just a guess on my part, but anything you can do to help the regex matcher limit its range of work, the better. xoxo, Andy	[reply]
Re^2: Very slow regex substitution on Unicode string by pbijnens (Novice) on May 26, 2011 at 06:58 UTC
There is not so much backtracking going on I think. I need to delete the whole xml-comment when it starts with the space+HINT. I think the fastest way is: `s/<!-- HINT.? -->//` [download] The ".?" should do minimal matching, i.e. stopping as soon as the string-so-far is followed by the litteral space-dash-dash-greaterthan. This (should) makes it do no backtracking at all, I believe. I do not think I can make it faster, or am I wrong? Also the removal of the initial BOM should not lead to much backtracking. The regex engine should look at most at one character ahead in this case (I think). `s/^\x{feff}*//` [download] And note that the whole regex was anchored at the beginning of the string too; again leading to much less work. I'll just wait for CentOS 6.x to come out, which has perl5.10, and until then just split up the regex. Good enough for me.	[reply] [d/l] [select]
Re^3: Very slow regex substitution on Unicode string by Anonymous Monk on May 26, 2011 at 15:11 UTC
You should get minimal backtracking. At the start of each word in the comment it will backtrack over the space between words. You should look into building your own up-to-date perl. I've found it to be remarkably easy on both Solaris & Linux boxes, and you get to use the latest & greatest perl for you own work. Just leave the system version where it is, and put your own on the #! line of your programs or put it earlier in your personal PATH. TJD	[reply]
Re: Very slow regex substitution on Unicode string by Anonymous Monk on May 25, 2011 at 19:54 UTC
I'm getting ... Is_utf8 My €0.02 of info Is_utf8 الأنشطة Benchmark: timing 2 iterations of ara nocmt 1step, ara nocmt 2step, ar +a w-cmt 1step, ara w-cmt 2step, lat nocmt 1step, lat nocmt 2step, lat + w-cmt 1step, lat w-cmt 2step... ara nocmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) ara nocmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) ara w-cmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) ara w-cmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) lat nocmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) lat nocmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) lat w-cmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) lat w-cmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) [download] ... on my slightly out of date 5.12 install. The fine folks who make this lovely little language of ours have done a lot of work on the regex & unicode stuff since in both 5.10 & 5.12. Perhaps you're being bitten by a bug related to your perl version? TJD	[reply] [d/l]
Re^2: Very slow regex substitution on Unicode string by Anonymous Monk on May 25, 2011 at 19:57 UTC
Bah, got in 2nd with an inferior reply. TJD	[reply]