pbijnens has asked for the wisdom of the Perl Monks concerning the following question:
I have this very strange problem.
In some cases this substition is very slow (several seconds on a short string!)
Only the "some cases" seems very arbitrary, and I want to understand why.s{^\x{feff}*<!-- HINT.*? -->}{};
Here is some code and below the benchmark results.
Here are the benchmarking results (on CentOS 5.6: "This is perl, v5.8.8 built for x86_64-linux-thread-multi"):#!/usr/bin/perl use strict; use warnings; use Encode; use Benchmark qw(:all); binmode(STDOUT, ":utf8"); #my $xmlcmt = "<!-- HINT: Some XML comments to be deleted -->"; my $xmlcmt = ""; my $str_lat = "My \x{20AC}0.02 of info"; my $str_ara = decode_utf8("\xd8\xa7\xd9\x84\xd8\xa3\xd9\x86\xd8\xb4\xd +8\xb7\xd8\xa9"); print Encode::is_utf8($str_lat) ? "Is_utf8" : "Not_utf8", "\t"; print + $str_lat, "\n"; print Encode::is_utf8($str_ara) ? "Is_utf8" : "Not_utf8", "\t"; print + $str_ara, "\n"; timethese(2, { "ara nocmt 1step" => sub { s1($str_ara) }, "ara nocmt 2step" => sub { s2($str_ara) }, "ara w-cmt 1step" => sub { s1($xmlcmt . $str_ara) }, "ara w-cmt 2step" => sub { s2($xmlcmt . $str_ara) }, "lat nocmt 1step" => sub { s1($str_lat) }, "lat nocmt 2step" => sub { s2($str_lat) }, "lat w-cmt 1step" => sub { s1($xmlcmt . $str_lat) }, "lat w-cmt 2step" => sub { s2($xmlcmt . $str_lat) }, }); # Do the replacement in 1 step sub s1 { my $s = shift; #$s =~ s{^\x{feff}*<!-- .*? -->}{}; # This is fast #$s =~ s{^\x{feff}*<!-- Hi.*? -->}{}; # This is fast #$s =~ s{^\x{feff}?<!-- HINT.*? -->}{}; # This is fast $s =~ s{^\x{feff}*<!-- HINT.*? -->}{}; # This could be slow } # Do the replacement in 2 steps sub s2 { my $s = shift; $s =~ s{^\x{feff}*}{}; $s =~ s{^<!-- HINT.*? -->}{}; }
ara nocmt 1step: 5 wallclock secs ( 4.72 usr + 0.04 sys = 4.76 CPU) @ 0.42/s (n=2) ara nocmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) ara w-cmt 1step: 5 wallclock secs ( 4.75 usr + 0.00 sys = 4.75 CPU) @ 0.42/s (n=2) ara w-cmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) lat nocmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) lat nocmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) lat w-cmt 1step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) lat w-cmt 2step: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Notice how the first and the third test take almost 5 seconds for 2 iterations only.
When I change arbitrary conditions, the substition is fast again:
I really like to understand what is going on here.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Very slow regex substitution on Unicode string
by tchrist (Pilgrim) on May 25, 2011 at 19:40 UTC | |
|
Re: Very slow regex substitution on Unicode string
by petdance (Parson) on May 26, 2011 at 04:23 UTC | |
by pbijnens (Novice) on May 26, 2011 at 06:58 UTC | |
by Anonymous Monk on May 26, 2011 at 15:11 UTC | |
|
Re: Very slow regex substitution on Unicode string
by Anonymous Monk on May 25, 2011 at 19:54 UTC | |
by Anonymous Monk on May 25, 2011 at 19:57 UTC |