comment on

I have this very strange problem.

In some cases this substition is very slow (several seconds on a short string!)

   s{^\x{feff}*<!-- HINT.*? -->}{};
[download]

Only the "some cases" seems very arbitrary, and I want to understand why.

Here is some code and below the benchmark results.

#!/usr/bin/perl

use strict;
use warnings;

use Encode;
use Benchmark qw(:all);

binmode(STDOUT, ":utf8");

#my $xmlcmt = "<!-- HINT: Some XML comments to be deleted -->";
my $xmlcmt = ""; 
my $str_lat = "My \x{20AC}0.02 of info";
my $str_ara = decode_utf8("\xd8\xa7\xd9\x84\xd8\xa3\xd9\x86\xd8\xb4\xd
+8\xb7\xd8\xa9");


print Encode::is_utf8($str_lat) ? "Is_utf8" : "Not_utf8", "\t";  print
+ $str_lat, "\n";
print Encode::is_utf8($str_ara) ? "Is_utf8" : "Not_utf8", "\t";  print
+ $str_ara, "\n";

timethese(2, {
    "ara nocmt 1step" => sub { s1($str_ara) },
    "ara nocmt 2step" => sub { s2($str_ara) },
    "ara w-cmt 1step" => sub { s1($xmlcmt . $str_ara) },
    "ara w-cmt 2step" => sub { s2($xmlcmt . $str_ara) },
    "lat nocmt 1step" => sub { s1($str_lat) },
    "lat nocmt 2step" => sub { s2($str_lat) },
    "lat w-cmt 1step" => sub { s1($xmlcmt . $str_lat) },
    "lat w-cmt 2step" => sub { s2($xmlcmt . $str_lat) },
 
});

# Do the replacement in 1 step
sub s1
{
    my $s = shift;
    #$s =~ s{^\x{feff}*<!-- .*? -->}{};         # This is fast
    #$s =~ s{^\x{feff}*<!-- Hi.*? -->}{};       # This is fast
    #$s =~ s{^\x{feff}?<!-- HINT.*? -->}{};     # This is fast
    $s =~ s{^\x{feff}*<!-- HINT.*? -->}{};      # This could be slow
}

# Do the replacement in 2 steps
sub s2
{
    my $s = shift;
    $s =~ s{^\x{feff}*}{};
    $s =~ s{^<!-- HINT.*? -->}{};
}
[download]

Here are the benchmarking results (on CentOS 5.6: "This is perl, v5.8.8 built for x86_64-linux-thread-multi"):

ara nocmt 1step:  5 wallclock secs ( 4.72 usr +  0.04 sys =  4.76 CPU) @  0.42/s (n=2)
ara nocmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
ara w-cmt 1step:  5 wallclock secs ( 4.75 usr +  0.00 sys =  4.75 CPU) @  0.42/s (n=2)
ara w-cmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat nocmt 1step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat nocmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat w-cmt 1step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
lat w-cmt 2step:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)

Notice how the first and the third test take almost 5 seconds for 2 iterations only.

When I change arbitrary conditions, the substition is fast again:

Splitting the substition into two steps makes it fast, as the benchmark code above shows.
Changing the "\x{feff}*" (zero-or-more) into "\x{feff}?" (zero-or-one) makes it fast.
Change the word "HINT" into someting else, and it's fast again (lowercase "hint" is still slow).
Changing the Arabic source string could make it faster (some arabic strings are fast, some are slow -- I did not yet find a latin-script string that is slow).
similar as previous item: really adding someting to delete, and thereby changing the Arabic string, makes it fast. (use the uncommented $xmlcmt in the benchmark code)

What exactly makes this slow?

Is there something wrong with the Arabic string? (I don't understand a single letter of it, I'm only the programmer handling the files.)
Is it the unicode nature of the string or regex and some bugs in perl 5.8.8 on CentOS 5.6?
Is it reproducable on other CentOS 5.6 machines (not maintained by me :-) )?

I really like to understand what is going on here.

In reply to Very slow regex substitution on Unicode string by pbijnens

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.