Re: More efficient way to truncate long strings of the same character

Benchmarks are fun and very often not worth writing. In this case it turns out the OP's "slow" regex ain't so bad:

The long:
            Rate ikegami2A mrm_l ikegami1 ikegami3 ikegami2x ikegami2 
+JavaFan kvale  ccn mrm_s orig   GF
ikegami2A  693/s        --  -60%     -66%     -69%      -76%     -76% 
+   -84%  -84% -85%  -89% -90% -90%
mrm_l     1717/s      148%    --     -15%     -24%      -40%     -42% 
+   -61%  -62% -63%  -72% -74% -75%
ikegami1  2023/s      192%   18%       --     -10%      -29%     -31% 
+   -54%  -55% -56%  -67% -70% -71%
ikegami3  2256/s      226%   31%      12%       --      -21%     -23% 
+   -49%  -49% -51%  -63% -66% -67%
ikegami2x 2856/s      312%   66%      41%      27%        --      -3% 
+   -36%  -36% -38%  -53% -57% -59%
ikegami2  2940/s      324%   71%      45%      30%        3%       -- 
+   -34%  -34% -36%  -52% -56% -58%
JavaFan   4443/s      542%  159%     120%      97%       56%      51% 
+     --   -1%  -3%  -27% -34% -36%
kvale     4467/s      545%  160%     121%      98%       56%      52% 
+     1%    --  -2%  -27% -33% -35%
ccn       4581/s      562%  167%     127%     103%       60%      56% 
+     3%    3%   --  -25% -32% -34%
mrm_s     6087/s      779%  254%     201%     170%      113%     107% 
+    37%   36%  33%    --  -9% -12%
orig      6689/s      866%  290%     231%     196%      134%     128% 
+    51%   50%  46%   10%   --  -3%
GF        6919/s      899%  303%     242%     207%      142%     135% 
+    56%   55%  51%   14%   3%   --
the short:
              Rate ikegami2x ikegami2A ikegami1 ikegami3 JavaFan mrm_l
+ kvale  ccn ikegami2 mrm_s   GF orig
ikegami2x  54274/s        --      -26%     -64%     -64%    -71%  -71%
+  -71% -75%     -78%  -78% -78% -84%
ikegami2A  73135/s       35%        --     -52%     -52%    -61%  -61%
+  -61% -66%     -70%  -70% -71% -79%
ikegami1  151690/s      179%      107%       --      -0%    -18%  -19%
+  -19% -30%     -38%  -38% -39% -56%
ikegami3  151690/s      179%      107%       0%       --    -18%  -19%
+  -19% -30%     -38%  -38% -39% -56%
JavaFan   185391/s      242%      153%      22%      22%      --   -1%
+   -2% -15%     -24%  -25% -26% -46%
mrm_l     188182/s      247%      157%      24%      24%      2%    --
+   -0% -13%     -22%  -24% -25% -46%
kvale     188359/s      247%      158%      24%      24%      2%    0%
+    -- -13%     -22%  -24% -25% -46%
ccn       217080/s      300%      197%      43%      43%     17%   15%
+   15%   --     -11%  -12% -13% -37%
ikegami2  242794/s      347%      232%      60%      60%     31%   29%
+   29%  12%       --   -1%  -3% -30%
mrm_s     246469/s      354%      237%      62%      62%     33%   31%
+   31%  14%       2%    --  -1% -29%
GF        249581/s      360%      241%      65%      65%     35%   33%
+   33%  15%       3%    1%   -- -28%
orig      346085/s      538%      373%     128%     128%     87%   84%
+   84%  59%      43%   40%  39%   --
and the check:
        orig: 333444
         ccn: 333444
       kvale: 333444
     JavaFan: 333444
          GF: 333444
    ikegami1: 3334444
    ikegami2: 333444
   ikegami2A: 333444
   ikegami2x: 333444
    ikegami3: 3334444
 mrm_s_const: 3334444
 mrm_l_const: 3334444
[download]

use warnings;
use strict;
use Benchmark qw(cmpthese);

my ($ikegamiRe1) = map qr/$_/, join '|', map "(?<=${_}{3})$_+", map qu
+otemeta,
    1 .. 9;
my ($ikegamiReA) = map qr/$_/, join '|', map "(?<=${_}{3})$_+", map qu
+otemeta,
    (1 .. 9, 'a' .. 'z', 'A' .. 'Z');
my $str = join '', map { $_ x $_ } 1 .. 9;
$str = $str x 20;

print "The long:\n";

cmpthese (
    -1,
    {
        orig    => \&orig,
        ccn     => \&ccn,
        kvale   => \&kvale,
        JavaFan => \&JavaFan,
        GF      => \&GF,
        ikegami1 => \&ikegami1,
        ikegami2 => \&ikegami2,
        ikegami2A => \&ikegami2A,
        ikegami2x => \&ikegami2x,
        ikegami3 => \&ikegami3,
        mrm_s => \&mrm_s_const,
        mrm_l => \&mrm_l_const,
    }
    );

$str = substr $str, 3, 7;
print "the short:\n";

cmpthese (
    -1,
    {
        orig    => \&orig,
        ccn     => \&ccn,
        kvale   => \&kvale,
        JavaFan => \&JavaFan,
        GF      => \&GF,
        ikegami1 => \&ikegami1,
        ikegami2 => \&ikegami2,
        ikegami2A => \&ikegami2A,
        ikegami2x => \&ikegami2x,
        ikegami3 => \&ikegami3,
        mrm_s => \&mrm_s_const,
        mrm_l => \&mrm_l_const,
    }
    );

print "and the check:\n";

for my $func (qw(
        orig ccn kvale JavaFan GF ikegami1 ikegami2 ikegami2A ikegami2
+x ikegami3
        mrm_s_const mrm_l_const
    )) {
    my $funcRef = \&{"$func"};

    &$funcRef;
    printf "%12s: %s\n", $func, $_;
}


sub orig {
    $_ = $str;
    s/(.)(?=\1\1\1)//gs;
}


sub ccn {
    $_ = $str;
    s/(.)\1{3,}/$1$1$1/gs;
}


sub kvale {
    $_ = $str;
    s/(.)\1{2,}/$1$1$1/g;
}


sub JavaFan {
    $_ = $str;
    s/(.)\1{2,}/$1$1$1/g;
}


sub GF {
    $_ = $str;
    s/((.)\2\2)\2+/$1/g;
}


sub ikegami1 {
    $_ = $str;
    s/$ikegamiRe1/substr($_,$-[0],3)/eg;
}


sub ikegami2 {
    $_ = $str;
    s/$ikegamiRe1//g;
}


sub ikegami2A {
    $_ = $str;
    s/$ikegamiReA//g;
}


sub ikegami2x {
    my ($re) = map qr/$_/, join '|', map "(?<=${_}{3})$_+", map quotem
+eta, 1 .. 9;

    $_ = $str;
    s/$re//g;
}


sub ikegami3 {
    $_ = $str;
    s/($ikegamiRe1)/substr($1,0,3)/eg;
}

sub mrm_long {
    my $len = length $_[1];

    ### loop prelude
    my $char = substr $_[1], 0, 1;
    my $count = 1;
    my $new_string = $char;
    my $at = 1;

    while ( $at < $len ) {
        my $old_char = $char;
        $char = substr $_[1], $at++, 1;

        if ( $char eq $old_char ) {
            next if $count >= $_[0];
            $count++;
        } else {
            $count = 1;
        }
        $new_string .= $char;
    }

    return $new_string;
}

sub mrm_short {
    ( my  $new_string = $_[1] ) =~ s/(.)\1{$_[0],}/$1 x $_[0]/eg;
    return $new_string;
}

sub mrm_s_const {
    ( my  $new_string = $str ) =~ s/(.)\1{3,}/$1 x 3/eg;
    return $new_string;
}

sub mrm_l_const {
    my $len = length $str;

    ### loop prelude
    my $char = substr $str, 0, 1;
    my $count = 1;
    my $new_string = $char;
    my $at = 1;

    while ( $at < $len ) {
        my $old_char = $char;
        $char = substr $str, $at++, 1;

        if ( $char eq $old_char ) {
            next if $count >= 3;
            $count++;
        } else {
            $count = 1;
        }
        $new_string .= $char;
    }

    return $new_string;
}
[download]

The ikegami digit versions use 1 .. 9. The 2A version uses digits plus lower and upper case letters. The 2x variant generates the expanded match regex for each iteration to represent a build per match cost.

Oh, and my regex? s/((.)\1{2})\1+/$2/g.

Update: Fixed GF regex as noted by johngg. Added the fixed length variant of mr_mischief solutions. Added a sanity check for the routines (which are interesting!) and regenerated the output (Perl 5.8.8 on XP).

Perl reduces RSI - it saves typing

Comment on Re: More efficient way to truncate long strings of the same character Select or Download Code

Replies are listed 'Best First'.
Re^2: More efficient way to truncate long strings of the same character by johngg (Canon) on Oct 30, 2008 at 21:39 UTC
`s/((.)\1{2})\1+/$2/g` I think you might have mixed up your '1's and '2's. `use strict; use warnings; my $str = q{aabbbbcddddddee}; print qq{Starting string\n $str\n}; # Your regex. $str =~ s/((.)\1{2})\1+/$2/g; print qq{Original regex\n $str\n}; # I think this is correct. $str = q{aabbbbcddddddee}; $str =~ s/((.)\2{2})\2+/$1/g; print qq{Corrected regex\n $str\n};` [download] Produces `Starting string aabbbbcddddddee Original regex aabbbbcddddddee Corrected regex aabbbcdddee` [download] This might affect the benchmark a little. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 30, 2008 at 23:37 UTC
It does change the benchmark for me somewhat. Under 5.10.0 on my 1GHz system, your fixed version of GrandFather's substitution actually performed better for the longer string 4 out of 5 runs of the benchmark. Interestingly enough, under 5.10.0 on this system, the original version posted is always one of the three fastest on the shorter string (and usually the fastest other than GrandFather's version). It's usually about 25% faster than the next. Under 5.8.8 on the same system, it's in a dead heat for the top for the longer string as well. 5.10.0 moves it back quite a bit in the rankings on the longer string, but it's still quite respectable in comparison. This reeks of premature optimization, but perhaps it isn't.	[reply]
Re^2: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 30, 2008 at 21:51 UTC
I made a few tweaks to mine in the spirit of benchmarking. I also made some further tweaks to provide versions with magic number constants. Notice how little the extra flexibility costs, even with an extra level of subroutine calls for each iteration. Chances are most of you are running faster systems than I have at work. Read more... (16 kB) I also tested under 5.8.8 on the Linux box. The results for the most part weren't very much different, but certain solutions are helped or hindered more than others by the switch. Read more... (6 kB)	[reply] [d/l] [select]
Re^2: More efficient way to truncate long strings of the same character by johngg (Canon) on Oct 31, 2008 at 15:16 UTC
I've added a couple of new methods to your benchmark. I tried a method (`&johngg`) that used a global match to build a list of offsets and lengths of characters to discard and then used substr, processing the list in reverse order, i.e. from the end of the string back towards the front to avoid screwing up the offsets. This approach was a middling performer in the benchmarks. I then refined it (`&johngg2`) to work directly rather than building a list by matching against the reversed string. This seems to be a bit faster, placing third on long strings and coming out top with short ones. Read more... (5 kB) The benchmark output. The long: Rate ikegami2A ikegami1 mrm_l ikegami3 ikegami2x ikegami2 +johngg orig kvale JavaFan ccn johngg2 mrm_s GF ikegami2A 185/s -- -74% -75% -77% -79% -80% + -87% -91% -91% -91% -92% -93% -94% -94% ikegami1 717/s 286% -- -2% -10% -18% -21% + -48% -65% -67% -67% -69% -72% -76% -78% mrm_l 730/s 294% 2% -- -9% -17% -20% + -47% -65% -66% -66% -68% -71% -75% -78% ikegami3 800/s 331% 12% 10% -- -9% -12% + -42% -61% -63% -63% -65% -69% -73% -75% ikegami2x 878/s 373% 23% 20% 10% -- -4% + -36% -58% -59% -59% -62% -65% -70% -73% ikegami2 913/s 392% 27% 25% 14% 4% -- + -34% -56% -58% -58% -60% -64% -69% -72% johngg 1380/s 644% 93% 89% 72% 57% 51% + -- -33% -36% -36% -40% -46% -53% -58% orig 2069/s 1016% 189% 183% 159% 136% 127% + 50% -- -4% -4% -10% -19% -30% -36% kvale 2160/s 1064% 201% 196% 170% 146% 137% + 57% 4% -- -0% -6% -15% -27% -34% JavaFan 2160/s 1064% 201% 196% 170% 146% 137% + 57% 4% 0% -- -6% -15% -27% -34% ccn 2303/s 1141% 221% 215% 188% 162% 152% + 67% 11% 7% 7% -- -9% -22% -29% johngg2 2544/s 1272% 255% 248% 218% 190% 179% + 84% 23% 18% 18% 10% -- -14% -22% mrm_s 2950/s 1490% 312% 304% 269% 236% 223% + 114% 43% 37% 37% 28% 16% -- -9% GF 3251/s 1653% 354% 345% 306% 270% 256% + 136% 57% 51% 51% 41% 28% 10% -- the short: Rate ikegami2x ikegami2A ikegami1 ikegami3 ikegami2 Java +Fan kvale ccn johngg mrm_l GF mrm_s orig johngg2 ikegami2x 15890/s -- -12% -62% -67% -72% - +77% -77% -79% -80% -82% -82% -82% -86% -88% ikegami2A 18088/s 14% -- -57% -63% -68% - +74% -74% -76% -78% -79% -79% -80% -84% -87% ikegami1 42081/s 165% 133% -- -13% -25% - +40% -40% -45% -48% -51% -51% -53% -62% -69% ikegami3 48558/s 206% 168% 15% -- -13% - +30% -30% -37% -40% -43% -44% -46% -56% -64% ikegami2 56109/s 253% 210% 33% 16% -- - +19% -19% -27% -31% -35% -35% -37% -50% -58% JavaFan 69591/s 338% 285% 65% 43% 24% + -- -0% -9% -14% -19% -19% -22% -38% -48% kvale 69625/s 338% 285% 65% 43% 24% + 0% -- -9% -14% -19% -19% -22% -38% -48% ccn 76588/s 382% 323% 82% 58% 36% +10% 10% -- -5% -11% -11% -14% -31% -43% johngg 81031/s 410% 348% 93% 67% 44% +16% 16% 6% -- -6% -6% -9% -27% -40% mrm_l 85929/s 441% 375% 104% 77% 53% +23% 23% 12% 6% -- -0% -4% -23% -36% GF 86067/s 442% 376% 105% 77% 53% +24% 24% 12% 6% 0% -- -3% -23% -36% mrm_s 89154/s 461% 393% 112% 84% 59% +28% 28% 16% 10% 4% 4% -- -20% -34% orig 111505/s 602% 516% 165% 130% 99% +60% 60% 46% 38% 30% 30% 25% -- -17% johngg2 134990/s 750% 646% 221% 178% 141% +94% 94% 76% 67% 57% 57% 51% 21% -- and the check: orig: 333444 ccn: 333444 kvale: 333444 johngg: 333444 johngg2: 333444 JavaFan: 333444 GF: 333444 ikegami1: 3334444 ikegami2: 333444 ikegami2A: 333444 ikegami2x: 333444 ikegami3: 3334444 mrm_s_const: 3334444 mrm_l_const: 3334444 [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 31, 2008 at 21:30 UTC
~~I think you need a sanity check on your sanity check. Where is $_ set? More importantly, where is it not set?~~ Update: You've included GrandFather's sanity check, but missed his error in it. See Re^2: More efficient way to truncate long strings of the same character for my remarks on that.	[reply]
Re^2: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 31, 2008 at 21:35 UTC
The sanity checks would be more interesting if they were correct. You are assuming $_ is set and modified in place, which holds true only for a subset of the solutions. Try changing the order in which you run the benchmarks, and you'd see that $_ will be the same after my solutions run as before, no matter who else's ran immediately beforehand. Perhaps you should check actual return values of subs which return values.	[reply]
Re^3: More efficient way to truncate long strings of the same character by GrandFather (Saint) on Oct 31, 2008 at 22:15 UTC
Aye, there's the trouble with starting with a quick and dirty design then accreting bits on to it. :( The validation step is important as a check that the tested code is behaving as advertised, but can introduce extra benchmarked code that is unrelated to the code being tested. If I get time I'll revisit the code and fix it, but it may take a while before I get to it. Perl reduces RSI - it saves typing	[reply]