in reply to More efficient way to truncate long strings of the same character

Benchmarks are fun and very often not worth writing. In this case it turns out the OP's "slow" regex ain't so bad:

The long: Rate ikegami2A mrm_l ikegami1 ikegami3 ikegami2x ikegami2 +JavaFan kvale ccn mrm_s orig GF ikegami2A 693/s -- -60% -66% -69% -76% -76% + -84% -84% -85% -89% -90% -90% mrm_l 1717/s 148% -- -15% -24% -40% -42% + -61% -62% -63% -72% -74% -75% ikegami1 2023/s 192% 18% -- -10% -29% -31% + -54% -55% -56% -67% -70% -71% ikegami3 2256/s 226% 31% 12% -- -21% -23% + -49% -49% -51% -63% -66% -67% ikegami2x 2856/s 312% 66% 41% 27% -- -3% + -36% -36% -38% -53% -57% -59% ikegami2 2940/s 324% 71% 45% 30% 3% -- + -34% -34% -36% -52% -56% -58% JavaFan 4443/s 542% 159% 120% 97% 56% 51% + -- -1% -3% -27% -34% -36% kvale 4467/s 545% 160% 121% 98% 56% 52% + 1% -- -2% -27% -33% -35% ccn 4581/s 562% 167% 127% 103% 60% 56% + 3% 3% -- -25% -32% -34% mrm_s 6087/s 779% 254% 201% 170% 113% 107% + 37% 36% 33% -- -9% -12% orig 6689/s 866% 290% 231% 196% 134% 128% + 51% 50% 46% 10% -- -3% GF 6919/s 899% 303% 242% 207% 142% 135% + 56% 55% 51% 14% 3% -- the short: Rate ikegami2x ikegami2A ikegami1 ikegami3 JavaFan mrm_l + kvale ccn ikegami2 mrm_s GF orig ikegami2x 54274/s -- -26% -64% -64% -71% -71% + -71% -75% -78% -78% -78% -84% ikegami2A 73135/s 35% -- -52% -52% -61% -61% + -61% -66% -70% -70% -71% -79% ikegami1 151690/s 179% 107% -- -0% -18% -19% + -19% -30% -38% -38% -39% -56% ikegami3 151690/s 179% 107% 0% -- -18% -19% + -19% -30% -38% -38% -39% -56% JavaFan 185391/s 242% 153% 22% 22% -- -1% + -2% -15% -24% -25% -26% -46% mrm_l 188182/s 247% 157% 24% 24% 2% -- + -0% -13% -22% -24% -25% -46% kvale 188359/s 247% 158% 24% 24% 2% 0% + -- -13% -22% -24% -25% -46% ccn 217080/s 300% 197% 43% 43% 17% 15% + 15% -- -11% -12% -13% -37% ikegami2 242794/s 347% 232% 60% 60% 31% 29% + 29% 12% -- -1% -3% -30% mrm_s 246469/s 354% 237% 62% 62% 33% 31% + 31% 14% 2% -- -1% -29% GF 249581/s 360% 241% 65% 65% 35% 33% + 33% 15% 3% 1% -- -28% orig 346085/s 538% 373% 128% 128% 87% 84% + 84% 59% 43% 40% 39% -- and the check: orig: 333444 ccn: 333444 kvale: 333444 JavaFan: 333444 GF: 333444 ikegami1: 3334444 ikegami2: 333444 ikegami2A: 333444 ikegami2x: 333444 ikegami3: 3334444 mrm_s_const: 3334444 mrm_l_const: 3334444
use warnings; use strict; use Benchmark qw(cmpthese); my ($ikegamiRe1) = map qr/$_/, join '|', map "(?<=${_}{3})$_+", map qu +otemeta, 1 .. 9; my ($ikegamiReA) = map qr/$_/, join '|', map "(?<=${_}{3})$_+", map qu +otemeta, (1 .. 9, 'a' .. 'z', 'A' .. 'Z'); my $str = join '', map { $_ x $_ } 1 .. 9; $str = $str x 20; print "The long:\n"; cmpthese ( -1, { orig => \&orig, ccn => \&ccn, kvale => \&kvale, JavaFan => \&JavaFan, GF => \&GF, ikegami1 => \&ikegami1, ikegami2 => \&ikegami2, ikegami2A => \&ikegami2A, ikegami2x => \&ikegami2x, ikegami3 => \&ikegami3, mrm_s => \&mrm_s_const, mrm_l => \&mrm_l_const, } ); $str = substr $str, 3, 7; print "the short:\n"; cmpthese ( -1, { orig => \&orig, ccn => \&ccn, kvale => \&kvale, JavaFan => \&JavaFan, GF => \&GF, ikegami1 => \&ikegami1, ikegami2 => \&ikegami2, ikegami2A => \&ikegami2A, ikegami2x => \&ikegami2x, ikegami3 => \&ikegami3, mrm_s => \&mrm_s_const, mrm_l => \&mrm_l_const, } ); print "and the check:\n"; for my $func (qw( orig ccn kvale JavaFan GF ikegami1 ikegami2 ikegami2A ikegami2 +x ikegami3 mrm_s_const mrm_l_const )) { my $funcRef = \&{"$func"}; &$funcRef; printf "%12s: %s\n", $func, $_; } sub orig { $_ = $str; s/(.)(?=\1\1\1)//gs; } sub ccn { $_ = $str; s/(.)\1{3,}/$1$1$1/gs; } sub kvale { $_ = $str; s/(.)\1{2,}/$1$1$1/g; } sub JavaFan { $_ = $str; s/(.)\1{2,}/$1$1$1/g; } sub GF { $_ = $str; s/((.)\2\2)\2+/$1/g; } sub ikegami1 { $_ = $str; s/$ikegamiRe1/substr($_,$-[0],3)/eg; } sub ikegami2 { $_ = $str; s/$ikegamiRe1//g; } sub ikegami2A { $_ = $str; s/$ikegamiReA//g; } sub ikegami2x { my ($re) = map qr/$_/, join '|', map "(?<=${_}{3})$_+", map quotem +eta, 1 .. 9; $_ = $str; s/$re//g; } sub ikegami3 { $_ = $str; s/($ikegamiRe1)/substr($1,0,3)/eg; } sub mrm_long { my $len = length $_[1]; ### loop prelude my $char = substr $_[1], 0, 1; my $count = 1; my $new_string = $char; my $at = 1; while ( $at < $len ) { my $old_char = $char; $char = substr $_[1], $at++, 1; if ( $char eq $old_char ) { next if $count >= $_[0]; $count++; } else { $count = 1; } $new_string .= $char; } return $new_string; } sub mrm_short { ( my $new_string = $_[1] ) =~ s/(.)\1{$_[0],}/$1 x $_[0]/eg; return $new_string; } sub mrm_s_const { ( my $new_string = $str ) =~ s/(.)\1{3,}/$1 x 3/eg; return $new_string; } sub mrm_l_const { my $len = length $str; ### loop prelude my $char = substr $str, 0, 1; my $count = 1; my $new_string = $char; my $at = 1; while ( $at < $len ) { my $old_char = $char; $char = substr $str, $at++, 1; if ( $char eq $old_char ) { next if $count >= 3; $count++; } else { $count = 1; } $new_string .= $char; } return $new_string; }

The ikegami digit versions use 1 .. 9. The 2A version uses digits plus lower and upper case letters. The 2x variant generates the expanded match regex for each iteration to represent a build per match cost.

Oh, and my regex? s/((.)\1{2})\1+/$2/g.

Update: Fixed GF regex as noted by johngg. Added the fixed length variant of mr_mischief solutions. Added a sanity check for the routines (which are interesting!) and regenerated the output (Perl 5.8.8 on XP).


Perl reduces RSI - it saves typing

Replies are listed 'Best First'.
Re^2: More efficient way to truncate long strings of the same character
by johngg (Canon) on Oct 30, 2008 at 21:39 UTC
    s/((.)\1{2})\1+/$2/g

    I think you might have mixed up your '1's and '2's.

    use strict; use warnings; my $str = q{aabbbbcddddddee}; print qq{Starting string\n $str\n}; # Your regex. $str =~ s/((.)\1{2})\1+/$2/g; print qq{Original regex\n $str\n}; # I think this is correct. $str = q{aabbbbcddddddee}; $str =~ s/((.)\2{2})\2+/$1/g; print qq{Corrected regex\n $str\n};

    Produces

    Starting string aabbbbcddddddee Original regex aabbbbcddddddee Corrected regex aabbbcdddee

    This might affect the benchmark a little.

    Cheers,

    JohnGG

      It does change the benchmark for me somewhat. Under 5.10.0 on my 1GHz system, your fixed version of GrandFather's substitution actually performed better for the longer string 4 out of 5 runs of the benchmark.

      Interestingly enough, under 5.10.0 on this system, the original version posted is always one of the three fastest on the shorter string (and usually the fastest other than GrandFather's version). It's usually about 25% faster than the next. Under 5.8.8 on the same system, it's in a dead heat for the top for the longer string as well. 5.10.0 moves it back quite a bit in the rankings on the longer string, but it's still quite respectable in comparison.

      This reeks of premature optimization, but perhaps it isn't.

Re^2: More efficient way to truncate long strings of the same character
by mr_mischief (Monsignor) on Oct 30, 2008 at 21:51 UTC
    I made a few tweaks to mine in the spirit of benchmarking. I also made some further tweaks to provide versions with magic number constants. Notice how little the extra flexibility costs, even with an extra level of subroutine calls for each iteration. Chances are most of you are running faster systems than I have at work. I also tested under 5.8.8 on the Linux box. The results for the most part weren't very much different, but certain solutions are helped or hindered more than others by the switch.
Re^2: More efficient way to truncate long strings of the same character
by johngg (Canon) on Oct 31, 2008 at 15:16 UTC
    I've added a couple of new methods to your benchmark.

    I tried a method (&johngg) that used a global match to build a list of offsets and lengths of characters to discard and then used substr, processing the list in reverse order, i.e. from the end of the string back towards the front to avoid screwing up the offsets. This approach was a middling performer in the benchmarks.

    I then refined it (&johngg2) to work directly rather than building a list by matching against the reversed string. This seems to be a bit faster, placing third on long strings and coming out top with short ones.

    The benchmark output.

    The long: Rate ikegami2A ikegami1 mrm_l ikegami3 ikegami2x ikegami2 +johngg orig kvale JavaFan ccn johngg2 mrm_s GF ikegami2A 185/s -- -74% -75% -77% -79% -80% + -87% -91% -91% -91% -92% -93% -94% -94% ikegami1 717/s 286% -- -2% -10% -18% -21% + -48% -65% -67% -67% -69% -72% -76% -78% mrm_l 730/s 294% 2% -- -9% -17% -20% + -47% -65% -66% -66% -68% -71% -75% -78% ikegami3 800/s 331% 12% 10% -- -9% -12% + -42% -61% -63% -63% -65% -69% -73% -75% ikegami2x 878/s 373% 23% 20% 10% -- -4% + -36% -58% -59% -59% -62% -65% -70% -73% ikegami2 913/s 392% 27% 25% 14% 4% -- + -34% -56% -58% -58% -60% -64% -69% -72% johngg 1380/s 644% 93% 89% 72% 57% 51% + -- -33% -36% -36% -40% -46% -53% -58% orig 2069/s 1016% 189% 183% 159% 136% 127% + 50% -- -4% -4% -10% -19% -30% -36% kvale 2160/s 1064% 201% 196% 170% 146% 137% + 57% 4% -- -0% -6% -15% -27% -34% JavaFan 2160/s 1064% 201% 196% 170% 146% 137% + 57% 4% 0% -- -6% -15% -27% -34% ccn 2303/s 1141% 221% 215% 188% 162% 152% + 67% 11% 7% 7% -- -9% -22% -29% johngg2 2544/s 1272% 255% 248% 218% 190% 179% + 84% 23% 18% 18% 10% -- -14% -22% mrm_s 2950/s 1490% 312% 304% 269% 236% 223% + 114% 43% 37% 37% 28% 16% -- -9% GF 3251/s 1653% 354% 345% 306% 270% 256% + 136% 57% 51% 51% 41% 28% 10% -- the short: Rate ikegami2x ikegami2A ikegami1 ikegami3 ikegami2 Java +Fan kvale ccn johngg mrm_l GF mrm_s orig johngg2 ikegami2x 15890/s -- -12% -62% -67% -72% - +77% -77% -79% -80% -82% -82% -82% -86% -88% ikegami2A 18088/s 14% -- -57% -63% -68% - +74% -74% -76% -78% -79% -79% -80% -84% -87% ikegami1 42081/s 165% 133% -- -13% -25% - +40% -40% -45% -48% -51% -51% -53% -62% -69% ikegami3 48558/s 206% 168% 15% -- -13% - +30% -30% -37% -40% -43% -44% -46% -56% -64% ikegami2 56109/s 253% 210% 33% 16% -- - +19% -19% -27% -31% -35% -35% -37% -50% -58% JavaFan 69591/s 338% 285% 65% 43% 24% + -- -0% -9% -14% -19% -19% -22% -38% -48% kvale 69625/s 338% 285% 65% 43% 24% + 0% -- -9% -14% -19% -19% -22% -38% -48% ccn 76588/s 382% 323% 82% 58% 36% +10% 10% -- -5% -11% -11% -14% -31% -43% johngg 81031/s 410% 348% 93% 67% 44% +16% 16% 6% -- -6% -6% -9% -27% -40% mrm_l 85929/s 441% 375% 104% 77% 53% +23% 23% 12% 6% -- -0% -4% -23% -36% GF 86067/s 442% 376% 105% 77% 53% +24% 24% 12% 6% 0% -- -3% -23% -36% mrm_s 89154/s 461% 393% 112% 84% 59% +28% 28% 16% 10% 4% 4% -- -20% -34% orig 111505/s 602% 516% 165% 130% 99% +60% 60% 46% 38% 30% 30% 25% -- -17% johngg2 134990/s 750% 646% 221% 178% 141% +94% 94% 76% 67% 57% 57% 51% 21% -- and the check: orig: 333444 ccn: 333444 kvale: 333444 johngg: 333444 johngg2: 333444 JavaFan: 333444 GF: 333444 ikegami1: 3334444 ikegami2: 333444 ikegami2A: 333444 ikegami2x: 333444 ikegami3: 3334444 mrm_s_const: 3334444 mrm_l_const: 3334444

    I hope this is of interest.

    Cheers,

    JohnGG

Re^2: More efficient way to truncate long strings of the same character
by mr_mischief (Monsignor) on Oct 31, 2008 at 21:35 UTC
    The sanity checks would be more interesting if they were correct. You are assuming $_ is set and modified in place, which holds true only for a subset of the solutions.

    Try changing the order in which you run the benchmarks, and you'd see that $_ will be the same after my solutions run as before, no matter who else's ran immediately beforehand. Perhaps you should check actual return values of subs which return values.

      Aye, there's the trouble with starting with a quick and dirty design then accreting bits on to it. :(

      The validation step is important as a check that the tested code is behaving as advertised, but can introduce extra benchmarked code that is unrelated to the code being tested. If I get time I'll revisit the code and fix it, but it may take a while before I get to it.


      Perl reduces RSI - it saves typing