This has taken a while because I encountered problems with benchmarking your solution. I probably would have given up at that point except that the problem itself is an interesting, in that it's indicative of a whole class of problems that require making global substitutions on a string, where the subsequent matches can overlap with previous replacements. The regex engine does not appear to have a good solution for this problem currently.
The OP's original two pass solution doesn't scale once the matched string moves beyond the end of the original replacement.
Taking the loop out of the regex as originally proposed by diotalevi works, but suffers from the need to start searching at the beginning of the string each time.
The next step is to try and remember where the last match started, and arrange to start the next iteration at that position +1, but as hv pointed out, the convenience /cg modifier combo that can be applied to m//, doesn't work with s/// as pos get reset after substitutions.
You, diotalevi and Enlil all came up with strategies for saving and restoring pos before / after substitution that work with varing degrees of success.
I tried a crude method of moving the substitution through the string using substr and a for loop, which -- despite forcing the test against every possible n-length substring, rather than allowing the regex engine to find the matches for itself -- stands up surprisingly well to benchmarking.
Its main advantage seems to be the simplicity. All the attempts to allow the regex engine to do the bulk of the searching (which it does so well in most cases) are offset by the need to make two or more function calls, do math and access additional variables.
So, should anyone else but me be interested in these esoteric findings, here is the benchmark and some results. I've haven't found a way around the trap that happens when your Lvalue ref solution is called more than twice, so I've had to specifically eliminate it from most of the tests. I've included the results from a couple of 1-pass runs on very long strings (one packed and one sparse) by way of comparison.
Every time I look at this, I still think that there is a simple solution that I am missing, but it has alluded me so far.
Results
D:\Perl\test>256024 -LEN=1000 -N=1 -CHECK s/iter dio1 ari1 dio2 enlil dio3 buk3 buk1 bu dio1 26.1 -- -46% -46% -95% -96% -98% -99% -9 ari1 14.2 84% -- 0% -90% -92% -97% -98% -9 dio2 14.2 84% 0% -- -90% -92% -97% -98% -9 enlil 1.41 1748% 906% 906% -- -20% -72% -77% -7 dio3 1.13 2205% 1155% 1155% 25% -- -65% -71% -7 buk3 0.401 6408% 3444% 3444% 252% 182% -- -17% -1 buk1 0.331 7785% 4193% 4193% 327% 242% 21% -- - buk2 0.330 7808% 4206% 4206% 328% 243% 22% 0% Ari1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... buk1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... buk2:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... buk3:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... dio1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... dio2:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... dio3:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... enll:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... D:\Perl\test>256024 -LEN=1000 -N=1 -SPARSE=5 -CHECK s/iter dio1 dio2 ari1 enlil dio3 buk1 buk2 buk3 dio1 162 -- -44% -46% -93% -95% -100% -100% -100% dio2 90.7 79% -- -4% -88% -92% -99% -99% -99% ari1 87.5 85% 4% -- -88% -91% -99% -99% -99% enlil 10.6 1421% 752% 721% -- -28% -93% -93% -95% dio3 7.67 2012% 1082% 1040% 39% -- -90% -90% -94% buk1 0.791 20376% 11364% 10956% 1246% 870% -- -5% -39% buk2 0.752 21438% 11959% 11530% 1316% 920% 5% -- -36% buk3 0.480 33642% 18792% 18120% 2118% 1498% 65% 57% -- Ari1:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... buk1:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... buk2:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... buk3:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... dio1:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... dio2:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... dio3:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... enll:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... D:\Perl\test>256024 -LEN=1 -N=-3 Rate buk3 dio3 enlil dio2 buk2 buk1 dio1 buk3 2190/s -- -1% -3% -4% -10% -13% -38% dio3 2210/s 1% -- -2% -3% -9% -13% -37% enlil 2259/s 3% 2% -- -1% -7% -11% -36% dio2 2272/s 4% 3% 1% -- -6% -10% -35% buk2 2430/s 11% 10% 8% 7% -- -4% -31% buk1 2527/s 15% 14% 12% 11% 4% -- -28% dio1 3510/s 60% 59% 55% 54% 44% 39% -- D:\Perl\test>256024 -LEN=10 -N=-3 Rate dio2 dio1 enlil dio3 buk3 buk1 buk2 dio2 177/s -- -15% -20% -20% -28% -39% -40% dio1 209/s 18% -- -5% -5% -15% -28% -29% enlil 219/s 24% 5% -- -0% -10% -25% -26% dio3 220/s 25% 5% 0% -- -10% -25% -26% buk3 245/s 39% 17% 12% 11% -- -16% -17% buk1 292/s 65% 40% 33% 33% 19% -- -2% buk2 296/s 68% 42% 35% 35% 21% 2% -- D:\Perl\test>256024 -LEN=100 -N=-3 Rate dio1 dio2 dio3 enlil buk3 buk1 buk2 dio1 3.74/s -- -36% -84% -84% -85% -87% -87% dio2 5.87/s 57% -- -74% -75% -76% -80% -80% dio3 22.8/s 510% 289% -- -3% -6% -22% -23% enlil 23.4/s 526% 299% 3% -- -3% -20% -21% buk3 24.3/s 548% 313% 6% 3% -- -17% -18% buk1 29.2/s 680% 398% 28% 25% 21% -- -2% buk2 29.7/s 694% 406% 30% 27% 23% 2% --
Benchmark
#! perl -slw use strict; use vars qw[$LEN $N $SPARSE $CHECK]; use Benchmark qw[cmpthese]; our $str = '17341234'; $str .= '....____' x $SPARSE if $SPARSE; $str x= ($LEN || 5); our (@ari, @buk, @dio, @enlil); my $test = { ari1=> q[ $ari[1] = $str; local *_ = do{ \$ari[1] }; our $i = 0; *_ = \substr( $ari[1], $i ) while s[(.)(...)\1(?{ $i = pos() - 4 + })][a$2b]; ], buk1=> q[ $buk[1] = $str; substr($buk[1], $_, 5) =~ s[(.)(...)\1][a$2b] for 0 .. length ($ +buk[1]); ], buk2=> q[ $buk[2] = $str; for( 0 .. length($buk[2]) ) { substr($buk[2], $_, 5) =~ s[(.)(.. +.)\1][a$2b] }; ], buk3=> q[ $buk[3] = $str; my $i; $i = pos($buk[3])-5 , substr( $buk[3], $i, 5) =~ s[(.)(...)\1][a$2b] , pos($buk[3]) = $i while $buk[3] =~ m[(.)(...)\1]g; ], dio1=> q[ $dio[1] = $str; 1 while $dio[1] =~ s/(.)(...)\1/a$2b/; ], dio2=> q[ $dio[2] = $str; $r = 0; 1 while substr($dio[2],$r) =~ s/(.)(...)\1(?{$r = pos() - 3})/a$ +2b/; ], dio3=> q[ $dio[3] = $str; $r = 0; 1 while substr($dio[3],$r) =~ s[(.)(...)\1(?{$r +=pos() - 4})][a +$2b]; ], enlil=> q[{ $enlil[1] = $str; our $i = 0; pos($enlil[1]) = $i - 4 while $enlil[1] =~ s[\G (.*?) (.) (...) \2 (?{ $i = pos() }) +] [ defined $1 ? $1.'a'.$3.'b' : 'a'.$3.'b' +]ex; }], }; # A bug with Lvalue refs cause Ari1 to segfault if executed more than +twice! delete $test->{ari1}, $ari[1] = 'Test not performed' if $N < 1 or $ +N > 2; cmpthese( $N||-3, $test ); exit unless $CHECK; print " Ari1:$ari[1] buk1:$buk[1] buk2:$buk[2] buk3:$buk[3] dio1:$dio[1] dio2:$dio[2] dio3:$dio[3] enll:$enlil[1] "; __END__
In reply to Re: Re^2: Replace zero-width grouping?
by BrowserUk
in thread Replace zero-width grouping?
by tinypig
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |