More efficient way to truncate long strings of the same character

kurtis has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: More efficient way to truncate long strings of the same character by GrandFather (Saint) on Oct 30, 2008 at 20:49 UTC
Benchmarks are fun and very often not worth writing. In this case it turns out the OP's "slow" regex ain't so bad: The long: Rate ikegami2A mrm_l ikegami1 ikegami3 ikegami2x ikegami2 +JavaFan kvale ccn mrm_s orig GF ikegami2A 693/s -- -60% -66% -69% -76% -76% + -84% -84% -85% -89% -90% -90% mrm_l 1717/s 148% -- -15% -24% -40% -42% + -61% -62% -63% -72% -74% -75% ikegami1 2023/s 192% 18% -- -10% -29% -31% + -54% -55% -56% -67% -70% -71% ikegami3 2256/s 226% 31% 12% -- -21% -23% + -49% -49% -51% -63% -66% -67% ikegami2x 2856/s 312% 66% 41% 27% -- -3% + -36% -36% -38% -53% -57% -59% ikegami2 2940/s 324% 71% 45% 30% 3% -- + -34% -34% -36% -52% -56% -58% JavaFan 4443/s 542% 159% 120% 97% 56% 51% + -- -1% -3% -27% -34% -36% kvale 4467/s 545% 160% 121% 98% 56% 52% + 1% -- -2% -27% -33% -35% ccn 4581/s 562% 167% 127% 103% 60% 56% + 3% 3% -- -25% -32% -34% mrm_s 6087/s 779% 254% 201% 170% 113% 107% + 37% 36% 33% -- -9% -12% orig 6689/s 866% 290% 231% 196% 134% 128% + 51% 50% 46% 10% -- -3% GF 6919/s 899% 303% 242% 207% 142% 135% + 56% 55% 51% 14% 3% -- the short: Rate ikegami2x ikegami2A ikegami1 ikegami3 JavaFan mrm_l + kvale ccn ikegami2 mrm_s GF orig ikegami2x 54274/s -- -26% -64% -64% -71% -71% + -71% -75% -78% -78% -78% -84% ikegami2A 73135/s 35% -- -52% -52% -61% -61% + -61% -66% -70% -70% -71% -79% ikegami1 151690/s 179% 107% -- -0% -18% -19% + -19% -30% -38% -38% -39% -56% ikegami3 151690/s 179% 107% 0% -- -18% -19% + -19% -30% -38% -38% -39% -56% JavaFan 185391/s 242% 153% 22% 22% -- -1% + -2% -15% -24% -25% -26% -46% mrm_l 188182/s 247% 157% 24% 24% 2% -- + -0% -13% -22% -24% -25% -46% kvale 188359/s 247% 158% 24% 24% 2% 0% + -- -13% -22% -24% -25% -46% ccn 217080/s 300% 197% 43% 43% 17% 15% + 15% -- -11% -12% -13% -37% ikegami2 242794/s 347% 232% 60% 60% 31% 29% + 29% 12% -- -1% -3% -30% mrm_s 246469/s 354% 237% 62% 62% 33% 31% + 31% 14% 2% -- -1% -29% GF 249581/s 360% 241% 65% 65% 35% 33% + 33% 15% 3% 1% -- -28% orig 346085/s 538% 373% 128% 128% 87% 84% + 84% 59% 43% 40% 39% -- and the check: orig: 333444 ccn: 333444 kvale: 333444 JavaFan: 333444 GF: 333444 ikegami1: 3334444 ikegami2: 333444 ikegami2A: 333444 ikegami2x: 333444 ikegami3: 3334444 mrm_s_const: 3334444 mrm_l_const: 3334444 [download] Read more... The benchmark code (4 kB) The ikegami digit versions use 1 .. 9. The 2A version uses digits plus lower and upper case letters. The 2x variant generates the expanded match regex for each iteration to represent a build per match cost. Oh, and my regex? `s/((.)\1{2})\1+/$2/g`. Update: Fixed GF regex as noted by johngg. Added the fixed length variant of mr_mischief solutions. Added a sanity check for the routines (which are interesting!) and regenerated the output (Perl 5.8.8 on XP). Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^2: More efficient way to truncate long strings of the same character by johngg (Canon) on Oct 30, 2008 at 21:39 UTC
`s/((.)\1{2})\1+/$2/g` I think you might have mixed up your '1's and '2's. `use strict; use warnings; my $str = q{aabbbbcddddddee}; print qq{Starting string\n $str\n}; # Your regex. $str =~ s/((.)\1{2})\1+/$2/g; print qq{Original regex\n $str\n}; # I think this is correct. $str = q{aabbbbcddddddee}; $str =~ s/((.)\2{2})\2+/$1/g; print qq{Corrected regex\n $str\n};` [download] Produces `Starting string aabbbbcddddddee Original regex aabbbbcddddddee Corrected regex aabbbcdddee` [download] This might affect the benchmark a little. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 30, 2008 at 23:37 UTC
It does change the benchmark for me somewhat. Under 5.10.0 on my 1GHz system, your fixed version of GrandFather's substitution actually performed better for the longer string 4 out of 5 runs of the benchmark. Interestingly enough, under 5.10.0 on this system, the original version posted is always one of the three fastest on the shorter string (and usually the fastest other than GrandFather's version). It's usually about 25% faster than the next. Under 5.8.8 on the same system, it's in a dead heat for the top for the longer string as well. 5.10.0 moves it back quite a bit in the rankings on the longer string, but it's still quite respectable in comparison. This reeks of premature optimization, but perhaps it isn't.	[reply]
Re^2: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 30, 2008 at 21:51 UTC
I made a few tweaks to mine in the spirit of benchmarking. I also made some further tweaks to provide versions with magic number constants. Notice how little the extra flexibility costs, even with an extra level of subroutine calls for each iteration. Chances are most of you are running faster systems than I have at work. Read more... (16 kB) I also tested under 5.8.8 on the Linux box. The results for the most part weren't very much different, but certain solutions are helped or hindered more than others by the switch. Read more... (6 kB)	[reply] [d/l] [select]
Re^2: More efficient way to truncate long strings of the same character by johngg (Canon) on Oct 31, 2008 at 15:16 UTC
I've added a couple of new methods to your benchmark. I tried a method (`&johngg`) that used a global match to build a list of offsets and lengths of characters to discard and then used substr, processing the list in reverse order, i.e. from the end of the string back towards the front to avoid screwing up the offsets. This approach was a middling performer in the benchmarks. I then refined it (`&johngg2`) to work directly rather than building a list by matching against the reversed string. This seems to be a bit faster, placing third on long strings and coming out top with short ones. Read more... (5 kB) The benchmark output. The long: Rate ikegami2A ikegami1 mrm_l ikegami3 ikegami2x ikegami2 +johngg orig kvale JavaFan ccn johngg2 mrm_s GF ikegami2A 185/s -- -74% -75% -77% -79% -80% + -87% -91% -91% -91% -92% -93% -94% -94% ikegami1 717/s 286% -- -2% -10% -18% -21% + -48% -65% -67% -67% -69% -72% -76% -78% mrm_l 730/s 294% 2% -- -9% -17% -20% + -47% -65% -66% -66% -68% -71% -75% -78% ikegami3 800/s 331% 12% 10% -- -9% -12% + -42% -61% -63% -63% -65% -69% -73% -75% ikegami2x 878/s 373% 23% 20% 10% -- -4% + -36% -58% -59% -59% -62% -65% -70% -73% ikegami2 913/s 392% 27% 25% 14% 4% -- + -34% -56% -58% -58% -60% -64% -69% -72% johngg 1380/s 644% 93% 89% 72% 57% 51% + -- -33% -36% -36% -40% -46% -53% -58% orig 2069/s 1016% 189% 183% 159% 136% 127% + 50% -- -4% -4% -10% -19% -30% -36% kvale 2160/s 1064% 201% 196% 170% 146% 137% + 57% 4% -- -0% -6% -15% -27% -34% JavaFan 2160/s 1064% 201% 196% 170% 146% 137% + 57% 4% 0% -- -6% -15% -27% -34% ccn 2303/s 1141% 221% 215% 188% 162% 152% + 67% 11% 7% 7% -- -9% -22% -29% johngg2 2544/s 1272% 255% 248% 218% 190% 179% + 84% 23% 18% 18% 10% -- -14% -22% mrm_s 2950/s 1490% 312% 304% 269% 236% 223% + 114% 43% 37% 37% 28% 16% -- -9% GF 3251/s 1653% 354% 345% 306% 270% 256% + 136% 57% 51% 51% 41% 28% 10% -- the short: Rate ikegami2x ikegami2A ikegami1 ikegami3 ikegami2 Java +Fan kvale ccn johngg mrm_l GF mrm_s orig johngg2 ikegami2x 15890/s -- -12% -62% -67% -72% - +77% -77% -79% -80% -82% -82% -82% -86% -88% ikegami2A 18088/s 14% -- -57% -63% -68% - +74% -74% -76% -78% -79% -79% -80% -84% -87% ikegami1 42081/s 165% 133% -- -13% -25% - +40% -40% -45% -48% -51% -51% -53% -62% -69% ikegami3 48558/s 206% 168% 15% -- -13% - +30% -30% -37% -40% -43% -44% -46% -56% -64% ikegami2 56109/s 253% 210% 33% 16% -- - +19% -19% -27% -31% -35% -35% -37% -50% -58% JavaFan 69591/s 338% 285% 65% 43% 24% + -- -0% -9% -14% -19% -19% -22% -38% -48% kvale 69625/s 338% 285% 65% 43% 24% + 0% -- -9% -14% -19% -19% -22% -38% -48% ccn 76588/s 382% 323% 82% 58% 36% +10% 10% -- -5% -11% -11% -14% -31% -43% johngg 81031/s 410% 348% 93% 67% 44% +16% 16% 6% -- -6% -6% -9% -27% -40% mrm_l 85929/s 441% 375% 104% 77% 53% +23% 23% 12% 6% -- -0% -4% -23% -36% GF 86067/s 442% 376% 105% 77% 53% +24% 24% 12% 6% 0% -- -3% -23% -36% mrm_s 89154/s 461% 393% 112% 84% 59% +28% 28% 16% 10% 4% 4% -- -20% -34% orig 111505/s 602% 516% 165% 130% 99% +60% 60% 46% 38% 30% 30% 25% -- -17% johngg2 134990/s 750% 646% 221% 178% 141% +94% 94% 76% 67% 57% 57% 51% 21% -- and the check: orig: 333444 ccn: 333444 kvale: 333444 johngg: 333444 johngg2: 333444 JavaFan: 333444 GF: 333444 ikegami1: 3334444 ikegami2: 333444 ikegami2A: 333444 ikegami2x: 333444 ikegami3: 3334444 mrm_s_const: 3334444 mrm_l_const: 3334444 [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 31, 2008 at 21:30 UTC
~~I think you need a sanity check on your sanity check. Where is $_ set? More importantly, where is it not set?~~ Update: You've included GrandFather's sanity check, but missed his error in it. See Re^2: More efficient way to truncate long strings of the same character for my remarks on that.	[reply]
Re^2: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 31, 2008 at 21:35 UTC
The sanity checks would be more interesting if they were correct. You are assuming $_ is set and modified in place, which holds true only for a subset of the solutions. Try changing the order in which you run the benchmarks, and you'd see that $_ will be the same after my solutions run as before, no matter who else's ran immediately beforehand. Perhaps you should check actual return values of subs which return values.	[reply]
Re^3: More efficient way to truncate long strings of the same character by GrandFather (Saint) on Oct 31, 2008 at 22:15 UTC
Aye, there's the trouble with starting with a quick and dirty design then accreting bits on to it. :( The validation step is important as a check that the tested code is behaving as advertised, but can introduce extra benchmarked code that is unrelated to the code being tested. If I get time I'll revisit the code and fix it, but it may take a while before I get to it. Perl reduces RSI - it saves typing	[reply]
Re: More efficient way to truncate long strings of the same character by ikegami (Patriarch) on Oct 30, 2008 at 19:11 UTC
It's my understanding that back references (\1) are a bit slow. What follows are solutions that don't use back references. You'll have to benchmark them to see if they're faster. `#my @chars = grep !$seen{$_}++, $text =~ /./g; my @chars = 'a'..'z'; my ($re) = map qr/$_/, join '\|', map "(?<=${_}{3})$_+", map quotemeta, @chars; $text =~ s/$re//g;` [download] `#my @chars = grep !$seen{$_}++, $text =~ /./g; my @chars = 'a'..'z'; my ($re) = map qr/$_/, join '\|', map "${_}{4,}", map quotemeta, @chars; $text =~ s/($re)/substr($1,0,3)/eg;` [download] `#my @chars = grep !$seen{$_}++, $text =~ /./g; my @chars = 'a'..'z'; my ($re) = map qr/$_/, join '\|', map "${_}{4,}", map quotemeta, @chars; $text =~ s/$re/substr($text,$-[0],3)/eg;` [download] Update: Fixed bug identified in reply.	[reply] [d/l] [select]
Re^2: More efficient way to truncate long strings of the same character by GrandFather (Saint) on Oct 30, 2008 at 20:36 UTC
`map "(?<=$_{3})$_+",` should be `map "(?<=${_}{3})$_+",`. Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^3: More efficient way to truncate long strings of the same character by ikegami (Patriarch) on Oct 30, 2008 at 23:41 UTC
It's not necessary. `>perl -e"$_='a'; print qr/$_{3}/" (?-xism:a{3})` [download] And beyond being unnecessary, it will never help either. If you were to do `${_}{...}` to prevent `$_{...}` from beint treated as a hash element, it still wouldn't do what you want. See Re: Of scalars, hashes, quantifiers, and regexen.	[reply] [d/l] [select]
Re^4: More efficient way to truncate long strings of the same character by GrandFather (Saint) on Oct 31, 2008 at 00:00 UTC
Re^5: More efficient way to truncate long strings of the same character by ikegami (Patriarch) on Oct 31, 2008 at 00:05 UTC
Re: More efficient way to truncate long strings of the same character by ccn (Vicar) on Oct 30, 2008 at 18:01 UTC
This one would be faster `$text =~ s/(.)\1{3,}/$1$1$1/gs;` [download] Updated: It is faster because it makes one substitution per char sequence. It replaces four or more repeated characters by three at one time.	[reply] [d/l]
Re: More efficient way to truncate long strings of the same character by mr_mischief (Monsignor) on Oct 30, 2008 at 20:38 UTC
TIMTOWTDI, so I'll give you two, both as subroutines. Reusable subroutines might help if you need to do this sort of thing often or if the length of the maximum run of identical characters changes from time to time. One is a generalization of what you've already been given and uses the substitution operator with an eval. It therefore uses the regular expression engine. The other uses a character count and substr instead. The former makes better use of language features in Perl. The latter is readily portable to language with weak or no support of regular expressions so long as they have decent string handling otherwise. I have no idea which is faster. A toy-sized solution like this really isn't worth benchmarking unless it's in a hotspot in a larger program. If backreferences are really a big thing to worry about, then the longer version might be worthwhile. The rest of the regex engine is pretty darn fast, though, so I imagine the substitution is faster whether or not backreferences are a speed issue. Contrasting the two solutions does show off the power of Perl's regular expression handling and the operators that work with it. The longer of these two subs only handles a specific case, but the substitution operator could do something entirely different with the change of just a few keystrokes. Behold, the power of chee... er, Perl. #!/usr/local/bin/perl use strict; use warnings; my $foo = 'abbcccddddeeeeeffffffggggggghhhhhhhhiiiiiiiii' . 'jjjjjjjjjj122333444455555666666777777788888888999999999!'; for my $longest ( 1 .. 5 ) { my $bar = max_run_1 ( $longest, $foo ); print 'max_run_1: ' . $bar . "\n"; $bar = max_run_2 ( $longest, $foo ); print 'max_run_2: ' . $bar . "\n"; } sub max_run_1 { my $max = $_[0]; my $len = length $_[1]; ### loop prelude my $char = substr $_[1], 0, 1; my $count = 1; my $new_string = $char; my $at = 1; while ( $at < $len ) { my $old_char = $char; $char = substr $_[1], $at++, 1; if ( $char eq $old_char ) { next if $count >= $max; $count++; } else { $count = 1; } $new_string .= $char; } return $new_string; } sub max_run_2 { my $max = $_[0]; ( my $new_string = $_[1] ) =~ s/(.)\1{$max,}/$1 x $max/eg; return $new_string; } [download]	[reply] [d/l]
Re: More efficient way to truncate long strings of the same character by BrowserUk (Patriarch) on Oct 30, 2008 at 18:38 UTC
If your strings are really quite long, and you want all characters in groups of 3, then using tr/// first is ~50 times quicker: `$x =~ tr[A-Za-z][]s; $x =~ s[(.)][$1 x 3]ge;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: More efficient way to truncate long strings of the same character by kvale (Monsignor) on Oct 30, 2008 at 18:22 UTC
`% perl -e'$x="aaaaaabbcccccd"; $x=~s/(.)\1{2,}/$1$1$1/g; print "$x\n"; +' aaabbcccd` [download] -Mark	[reply] [d/l]
Re^2: More efficient way to truncate long strings of the same character by JavaFan (Canon) on Oct 30, 2008 at 18:34 UTC
That would do needless work if a character is repeated only twice. You only have to modify the string is a character appears four or more times in a row. So, my suggestion is: `s/(.)\1{3,}/$1$1$1/g;` [download] And in 5.10, use of the \K operator may be faster (but I haven't checked it) `s/(.)\1\1\K\1+//g;` [download]	[reply] [d/l] [select]