Re: More efficient way to truncate long strings of the same character
by GrandFather (Saint) on Oct 30, 2008 at 20:49 UTC
|
Benchmarks are fun and very often not worth writing. In this case it turns out the OP's "slow" regex ain't so bad:
The long:
Rate ikegami2A mrm_l ikegami1 ikegami3 ikegami2x ikegami2
+JavaFan kvale ccn mrm_s orig GF
ikegami2A 693/s -- -60% -66% -69% -76% -76%
+ -84% -84% -85% -89% -90% -90%
mrm_l 1717/s 148% -- -15% -24% -40% -42%
+ -61% -62% -63% -72% -74% -75%
ikegami1 2023/s 192% 18% -- -10% -29% -31%
+ -54% -55% -56% -67% -70% -71%
ikegami3 2256/s 226% 31% 12% -- -21% -23%
+ -49% -49% -51% -63% -66% -67%
ikegami2x 2856/s 312% 66% 41% 27% -- -3%
+ -36% -36% -38% -53% -57% -59%
ikegami2 2940/s 324% 71% 45% 30% 3% --
+ -34% -34% -36% -52% -56% -58%
JavaFan 4443/s 542% 159% 120% 97% 56% 51%
+ -- -1% -3% -27% -34% -36%
kvale 4467/s 545% 160% 121% 98% 56% 52%
+ 1% -- -2% -27% -33% -35%
ccn 4581/s 562% 167% 127% 103% 60% 56%
+ 3% 3% -- -25% -32% -34%
mrm_s 6087/s 779% 254% 201% 170% 113% 107%
+ 37% 36% 33% -- -9% -12%
orig 6689/s 866% 290% 231% 196% 134% 128%
+ 51% 50% 46% 10% -- -3%
GF 6919/s 899% 303% 242% 207% 142% 135%
+ 56% 55% 51% 14% 3% --
the short:
Rate ikegami2x ikegami2A ikegami1 ikegami3 JavaFan mrm_l
+ kvale ccn ikegami2 mrm_s GF orig
ikegami2x 54274/s -- -26% -64% -64% -71% -71%
+ -71% -75% -78% -78% -78% -84%
ikegami2A 73135/s 35% -- -52% -52% -61% -61%
+ -61% -66% -70% -70% -71% -79%
ikegami1 151690/s 179% 107% -- -0% -18% -19%
+ -19% -30% -38% -38% -39% -56%
ikegami3 151690/s 179% 107% 0% -- -18% -19%
+ -19% -30% -38% -38% -39% -56%
JavaFan 185391/s 242% 153% 22% 22% -- -1%
+ -2% -15% -24% -25% -26% -46%
mrm_l 188182/s 247% 157% 24% 24% 2% --
+ -0% -13% -22% -24% -25% -46%
kvale 188359/s 247% 158% 24% 24% 2% 0%
+ -- -13% -22% -24% -25% -46%
ccn 217080/s 300% 197% 43% 43% 17% 15%
+ 15% -- -11% -12% -13% -37%
ikegami2 242794/s 347% 232% 60% 60% 31% 29%
+ 29% 12% -- -1% -3% -30%
mrm_s 246469/s 354% 237% 62% 62% 33% 31%
+ 31% 14% 2% -- -1% -29%
GF 249581/s 360% 241% 65% 65% 35% 33%
+ 33% 15% 3% 1% -- -28%
orig 346085/s 538% 373% 128% 128% 87% 84%
+ 84% 59% 43% 40% 39% --
and the check:
orig: 333444
ccn: 333444
kvale: 333444
JavaFan: 333444
GF: 333444
ikegami1: 3334444
ikegami2: 333444
ikegami2A: 333444
ikegami2x: 333444
ikegami3: 3334444
mrm_s_const: 3334444
mrm_l_const: 3334444
The ikegami digit versions use 1 .. 9. The 2A version uses digits plus lower and upper case letters. The 2x variant generates the expanded match regex for each iteration to represent a build per match cost.
Oh, and my regex? s/((.)\1{2})\1+/$2/g.
Update: Fixed GF regex as noted by johngg. Added the fixed length variant of mr_mischief solutions. Added a sanity check for the routines (which are interesting!) and regenerated the output (Perl 5.8.8 on XP).
Perl reduces RSI - it saves typing
| [reply] [d/l] [select] |
|
|
use strict;
use warnings;
my $str = q{aabbbbcddddddee};
print qq{Starting string\n $str\n};
# Your regex.
$str =~ s/((.)\1{2})\1+/$2/g;
print qq{Original regex\n $str\n};
# I think this is correct.
$str = q{aabbbbcddddddee};
$str =~ s/((.)\2{2})\2+/$1/g;
print qq{Corrected regex\n $str\n};
Produces
Starting string
aabbbbcddddddee
Original regex
aabbbbcddddddee
Corrected regex
aabbbcdddee
This might affect the benchmark a little. Cheers, JohnGG | [reply] [d/l] [select] |
|
|
It does change the benchmark for me somewhat. Under 5.10.0 on my 1GHz system, your fixed version of GrandFather's substitution actually performed better for the longer string 4 out of 5 runs of the benchmark.
Interestingly enough, under 5.10.0 on this system, the original version posted is always one of the three fastest on the shorter string (and usually the fastest other than GrandFather's version). It's usually about 25% faster than the next. Under 5.8.8 on the same system, it's in a dead heat for the top for the longer string as well. 5.10.0 moves it back quite a bit in the rankings on the longer string, but it's still quite respectable in comparison.
This reeks of premature optimization, but perhaps it isn't.
| [reply] |
|
|
I made a few tweaks to mine in the spirit of benchmarking. I also made some further tweaks to provide versions with magic number constants. Notice how little the extra flexibility costs, even with an extra level of subroutine calls for each iteration. Chances are most of you are running faster systems than I have at work.
I also tested under 5.8.8 on the Linux box. The results for the most part weren't very much different, but certain solutions are helped or hindered more than others by the switch.
| [reply] [d/l] [select] |
|
|
I've added a couple of new methods to your benchmark.I tried a method (&johngg) that used a global match to build a list of offsets and lengths of characters to discard and then used substr, processing the list in reverse order, i.e. from the end of the string back towards the front to avoid screwing up the offsets. This approach was a middling performer in the benchmarks.
I then refined it (&johngg2) to work directly rather than building a list by matching against the reversed string. This seems to be a bit faster, placing third on long strings and coming out top with short ones.
The benchmark output.
The long:
Rate ikegami2A ikegami1 mrm_l ikegami3 ikegami2x ikegami2
+johngg orig kvale JavaFan ccn johngg2 mrm_s GF
ikegami2A 185/s -- -74% -75% -77% -79% -80%
+ -87% -91% -91% -91% -92% -93% -94% -94%
ikegami1 717/s 286% -- -2% -10% -18% -21%
+ -48% -65% -67% -67% -69% -72% -76% -78%
mrm_l 730/s 294% 2% -- -9% -17% -20%
+ -47% -65% -66% -66% -68% -71% -75% -78%
ikegami3 800/s 331% 12% 10% -- -9% -12%
+ -42% -61% -63% -63% -65% -69% -73% -75%
ikegami2x 878/s 373% 23% 20% 10% -- -4%
+ -36% -58% -59% -59% -62% -65% -70% -73%
ikegami2 913/s 392% 27% 25% 14% 4% --
+ -34% -56% -58% -58% -60% -64% -69% -72%
johngg 1380/s 644% 93% 89% 72% 57% 51%
+ -- -33% -36% -36% -40% -46% -53% -58%
orig 2069/s 1016% 189% 183% 159% 136% 127%
+ 50% -- -4% -4% -10% -19% -30% -36%
kvale 2160/s 1064% 201% 196% 170% 146% 137%
+ 57% 4% -- -0% -6% -15% -27% -34%
JavaFan 2160/s 1064% 201% 196% 170% 146% 137%
+ 57% 4% 0% -- -6% -15% -27% -34%
ccn 2303/s 1141% 221% 215% 188% 162% 152%
+ 67% 11% 7% 7% -- -9% -22% -29%
johngg2 2544/s 1272% 255% 248% 218% 190% 179%
+ 84% 23% 18% 18% 10% -- -14% -22%
mrm_s 2950/s 1490% 312% 304% 269% 236% 223%
+ 114% 43% 37% 37% 28% 16% -- -9%
GF 3251/s 1653% 354% 345% 306% 270% 256%
+ 136% 57% 51% 51% 41% 28% 10% --
the short:
Rate ikegami2x ikegami2A ikegami1 ikegami3 ikegami2 Java
+Fan kvale ccn johngg mrm_l GF mrm_s orig johngg2
ikegami2x 15890/s -- -12% -62% -67% -72% -
+77% -77% -79% -80% -82% -82% -82% -86% -88%
ikegami2A 18088/s 14% -- -57% -63% -68% -
+74% -74% -76% -78% -79% -79% -80% -84% -87%
ikegami1 42081/s 165% 133% -- -13% -25% -
+40% -40% -45% -48% -51% -51% -53% -62% -69%
ikegami3 48558/s 206% 168% 15% -- -13% -
+30% -30% -37% -40% -43% -44% -46% -56% -64%
ikegami2 56109/s 253% 210% 33% 16% -- -
+19% -19% -27% -31% -35% -35% -37% -50% -58%
JavaFan 69591/s 338% 285% 65% 43% 24%
+ -- -0% -9% -14% -19% -19% -22% -38% -48%
kvale 69625/s 338% 285% 65% 43% 24%
+ 0% -- -9% -14% -19% -19% -22% -38% -48%
ccn 76588/s 382% 323% 82% 58% 36%
+10% 10% -- -5% -11% -11% -14% -31% -43%
johngg 81031/s 410% 348% 93% 67% 44%
+16% 16% 6% -- -6% -6% -9% -27% -40%
mrm_l 85929/s 441% 375% 104% 77% 53%
+23% 23% 12% 6% -- -0% -4% -23% -36%
GF 86067/s 442% 376% 105% 77% 53%
+24% 24% 12% 6% 0% -- -3% -23% -36%
mrm_s 89154/s 461% 393% 112% 84% 59%
+28% 28% 16% 10% 4% 4% -- -20% -34%
orig 111505/s 602% 516% 165% 130% 99%
+60% 60% 46% 38% 30% 30% 25% -- -17%
johngg2 134990/s 750% 646% 221% 178% 141%
+94% 94% 76% 67% 57% 57% 51% 21% --
and the check:
orig: 333444
ccn: 333444
kvale: 333444
johngg: 333444
johngg2: 333444
JavaFan: 333444
GF: 333444
ikegami1: 3334444
ikegami2: 333444
ikegami2A: 333444
ikegami2x: 333444
ikegami3: 3334444
mrm_s_const: 3334444
mrm_l_const: 3334444
I hope this is of interest. Cheers, JohnGG | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
The sanity checks would be more interesting if they were correct. You are assuming $_ is set and modified in place, which holds true only for a subset of the solutions.
Try changing the order in which you run the benchmarks, and you'd see that $_ will be the same after my solutions run as before, no matter who else's ran immediately beforehand. Perhaps you should check actual return values of subs which return values.
| [reply] |
|
|
Aye, there's the trouble with starting with a quick and dirty design then accreting bits on to it. :(
The validation step is important as a check that the tested code is behaving as advertised, but can introduce extra benchmarked code that is unrelated to the code being tested. If I get time I'll revisit the code and fix it, but it may take a while before I get to it.
Perl reduces RSI - it saves typing
| [reply] |
Re: More efficient way to truncate long strings of the same character
by ikegami (Patriarch) on Oct 30, 2008 at 19:11 UTC
|
It's my understanding that back references (\1) are a bit slow. What follows are solutions that don't use back references. You'll have to benchmark them to see if they're faster.
#my @chars = grep !$seen{$_}++, $text =~ /./g;
my @chars = 'a'..'z';
my ($re) = map qr/$_/,
join '|',
map "(?<=${_}{3})$_+",
map quotemeta,
@chars;
$text =~ s/$re//g;
#my @chars = grep !$seen{$_}++, $text =~ /./g;
my @chars = 'a'..'z';
my ($re) = map qr/$_/,
join '|',
map "${_}{4,}",
map quotemeta,
@chars;
$text =~ s/($re)/substr($1,0,3)/eg;
#my @chars = grep !$seen{$_}++, $text =~ /./g;
my @chars = 'a'..'z';
my ($re) = map qr/$_/,
join '|',
map "${_}{4,}",
map quotemeta,
@chars;
$text =~ s/$re/substr($text,$-[0],3)/eg;
Update: Fixed bug identified in reply. | [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
|
|
>perl -e"$_='a'; print qr/$_{3}/"
(?-xism:a{3})
And beyond being unnecessary, it will never help either. If you were to do ${_}{...} to prevent $_{...} from beint treated as a hash element, it still wouldn't do what you want. See Re: Of scalars, hashes, quantifiers, and regexen. | [reply] [d/l] [select] |
|
|
|
|
Re: More efficient way to truncate long strings of the same character
by ccn (Vicar) on Oct 30, 2008 at 18:01 UTC
|
$text =~ s/(.)\1{3,}/$1$1$1/gs;
Updated: It is faster because it makes one substitution per char sequence. It replaces four or more repeated characters by three at one time.
| [reply] [d/l] |
Re: More efficient way to truncate long strings of the same character
by mr_mischief (Monsignor) on Oct 30, 2008 at 20:38 UTC
|
TIMTOWTDI, so I'll give you two, both as subroutines. Reusable subroutines might help if you need to do this sort of thing often or if the length of the maximum run of identical characters changes from time to time.
One is a generalization of what you've already been given and uses the substitution operator with an eval. It therefore uses the regular expression engine. The other uses a character count and substr instead. The former makes better use of language features in Perl. The latter is readily portable to language with weak or no support of regular expressions so long as they have decent string handling otherwise.
I have no idea which is faster. A toy-sized solution like this really isn't worth benchmarking unless it's in a hotspot in a larger program. If backreferences are really a big thing to worry about, then the longer version might be worthwhile. The rest of the regex engine is pretty darn fast, though, so I imagine the substitution is faster whether or not backreferences are a speed issue.
Contrasting the two solutions does show off the power of Perl's regular expression handling and the operators that work with it. The longer of these two subs only handles a specific case, but the substitution operator could do something entirely different with the change of just a few keystrokes. Behold, the power of chee... er, Perl.
#!/usr/local/bin/perl
use strict;
use warnings;
my $foo = 'abbcccddddeeeeeffffffggggggghhhhhhhhiiiiiiiii'
. 'jjjjjjjjjj122333444455555666666777777788888888999999999!';
for my $longest ( 1 .. 5 ) {
my $bar = max_run_1 ( $longest, $foo );
print 'max_run_1: ' . $bar . "\n";
$bar = max_run_2 ( $longest, $foo );
print 'max_run_2: ' . $bar . "\n";
}
sub max_run_1 {
my $max = $_[0];
my $len = length $_[1];
### loop prelude
my $char = substr $_[1], 0, 1;
my $count = 1;
my $new_string = $char;
my $at = 1;
while ( $at < $len ) {
my $old_char = $char;
$char = substr $_[1], $at++, 1;
if ( $char eq $old_char ) {
next if $count >= $max;
$count++;
} else {
$count = 1;
}
$new_string .= $char;
}
return $new_string;
}
sub max_run_2 {
my $max = $_[0];
( my $new_string = $_[1] ) =~ s/(.)\1{$max,}/$1 x $max/eg;
return $new_string;
}
| [reply] [d/l] |
Re: More efficient way to truncate long strings of the same character
by BrowserUk (Patriarch) on Oct 30, 2008 at 18:38 UTC
|
If your strings are really quite long, and you want all characters in groups of 3, then using tr/// first is ~50 times quicker:
$x =~ tr[A-Za-z][]s;
$x =~ s[(.)][$1 x 3]ge;
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: More efficient way to truncate long strings of the same character
by kvale (Monsignor) on Oct 30, 2008 at 18:22 UTC
|
% perl -e'$x="aaaaaabbcccccd"; $x=~s/(.)\1{2,}/$1$1$1/g; print "$x\n";
+'
aaabbcccd
| [reply] [d/l] |
|
|
That would do needless work if a character is repeated only twice. You only have to modify the string is a character appears four or more times in a row. So, my suggestion is:
s/(.)\1{3,}/$1$1$1/g;
And in 5.10, use of the \K operator may be faster (but I haven't checked it)
s/(.)\1\1\K\1+//g;
| [reply] [d/l] [select] |