in reply to Re^2: Replace zero-width grouping? (s/// against substr)
in thread Replace zero-width grouping?

This has taken a while because I encountered problems with benchmarking your solution. I probably would have given up at that point except that the problem itself is an interesting, in that it's indicative of a whole class of problems that require making global substitutions on a string, where the subsequent matches can overlap with previous replacements. The regex engine does not appear to have a good solution for this problem currently.

The OP's original two pass solution doesn't scale once the matched string moves beyond the end of the original replacement.

Taking the loop out of the regex as originally proposed by diotalevi works, but suffers from the need to start searching at the beginning of the string each time.

The next step is to try and remember where the last match started, and arrange to start the next iteration at that position +1, but as hv pointed out, the convenience /cg modifier combo that can be applied to m//, doesn't work with s/// as pos get reset after substitutions.

You, diotalevi and Enlil all came up with strategies for saving and restoring pos before / after substitution that work with varing degrees of success.

I tried a crude method of moving the substitution through the string using substr and a for loop, which -- despite forcing the test against every possible n-length substring, rather than allowing the regex engine to find the matches for itself -- stands up surprisingly well to benchmarking.

Its main advantage seems to be the simplicity. All the attempts to allow the regex engine to do the bulk of the searching (which it does so well in most cases) are offset by the need to make two or more function calls, do math and access additional variables.

So, should anyone else but me be interested in these esoteric findings, here is the benchmark and some results. I've haven't found a way around the trap that happens when your Lvalue ref solution is called more than twice, so I've had to specifically eliminate it from most of the tests. I've included the results from a couple of 1-pass runs on very long strings (one packed and one sparse) by way of comparison.

Every time I look at this, I still think that there is a simple solution that I am missing, but it has alluded me so far.

Results

D:\Perl\test>256024 -LEN=1000 -N=1 -CHECK s/iter dio1 ari1 dio2 enlil dio3 buk3 buk1 bu dio1 26.1 -- -46% -46% -95% -96% -98% -99% -9 ari1 14.2 84% -- 0% -90% -92% -97% -98% -9 dio2 14.2 84% 0% -- -90% -92% -97% -98% -9 enlil 1.41 1748% 906% 906% -- -20% -72% -77% -7 dio3 1.13 2205% 1155% 1155% 25% -- -65% -71% -7 buk3 0.401 6408% 3444% 3444% 252% 182% -- -17% -1 buk1 0.331 7785% 4193% 4193% 327% 242% 21% -- - buk2 0.330 7808% 4206% 4206% 328% 243% 22% 0% Ari1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... buk1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... buk2:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... buk3:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... dio1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... dio2:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... dio3:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... enll:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... D:\Perl\test>256024 -LEN=1000 -N=1 -SPARSE=5 -CHECK s/iter dio1 dio2 ari1 enlil dio3 buk1 buk2 buk3 dio1 162 -- -44% -46% -93% -95% -100% -100% -100% dio2 90.7 79% -- -4% -88% -92% -99% -99% -99% ari1 87.5 85% 4% -- -88% -91% -99% -99% -99% enlil 10.6 1421% 752% 721% -- -28% -93% -93% -95% dio3 7.67 2012% 1082% 1040% 39% -- -90% -90% -94% buk1 0.791 20376% 11364% 10956% 1246% 870% -- -5% -39% buk2 0.752 21438% 11959% 11530% 1316% 920% 5% -- -36% buk3 0.480 33642% 18792% 18120% 2118% 1498% 65% 57% -- Ari1:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... buk1:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... buk2:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... buk3:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... dio1:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... dio2:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... dio3:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... enll:a7aab2bb....____....____....____....____....____a7aab2bb....____. +... D:\Perl\test>256024 -LEN=1 -N=-3 Rate buk3 dio3 enlil dio2 buk2 buk1 dio1 buk3 2190/s -- -1% -3% -4% -10% -13% -38% dio3 2210/s 1% -- -2% -3% -9% -13% -37% enlil 2259/s 3% 2% -- -1% -7% -11% -36% dio2 2272/s 4% 3% 1% -- -6% -10% -35% buk2 2430/s 11% 10% 8% 7% -- -4% -31% buk1 2527/s 15% 14% 12% 11% 4% -- -28% dio1 3510/s 60% 59% 55% 54% 44% 39% -- D:\Perl\test>256024 -LEN=10 -N=-3 Rate dio2 dio1 enlil dio3 buk3 buk1 buk2 dio2 177/s -- -15% -20% -20% -28% -39% -40% dio1 209/s 18% -- -5% -5% -15% -28% -29% enlil 219/s 24% 5% -- -0% -10% -25% -26% dio3 220/s 25% 5% 0% -- -10% -25% -26% buk3 245/s 39% 17% 12% 11% -- -16% -17% buk1 292/s 65% 40% 33% 33% 19% -- -2% buk2 296/s 68% 42% 35% 35% 21% 2% -- D:\Perl\test>256024 -LEN=100 -N=-3 Rate dio1 dio2 dio3 enlil buk3 buk1 buk2 dio1 3.74/s -- -36% -84% -84% -85% -87% -87% dio2 5.87/s 57% -- -74% -75% -76% -80% -80% dio3 22.8/s 510% 289% -- -3% -6% -22% -23% enlil 23.4/s 526% 299% 3% -- -3% -20% -21% buk3 24.3/s 548% 313% 6% 3% -- -17% -18% buk1 29.2/s 680% 398% 28% 25% 21% -- -2% buk2 29.7/s 694% 406% 30% 27% 23% 2% --

Benchmark

#! perl -slw use strict; use vars qw[$LEN $N $SPARSE $CHECK]; use Benchmark qw[cmpthese]; our $str = '17341234'; $str .= '....____' x $SPARSE if $SPARSE; $str x= ($LEN || 5); our (@ari, @buk, @dio, @enlil); my $test = { ari1=> q[ $ari[1] = $str; local *_ = do{ \$ari[1] }; our $i = 0; *_ = \substr( $ari[1], $i ) while s[(.)(...)\1(?{ $i = pos() - 4 + })][a$2b]; ], buk1=> q[ $buk[1] = $str; substr($buk[1], $_, 5) =~ s[(.)(...)\1][a$2b] for 0 .. length ($ +buk[1]); ], buk2=> q[ $buk[2] = $str; for( 0 .. length($buk[2]) ) { substr($buk[2], $_, 5) =~ s[(.)(.. +.)\1][a$2b] }; ], buk3=> q[ $buk[3] = $str; my $i; $i = pos($buk[3])-5 , substr( $buk[3], $i, 5) =~ s[(.)(...)\1][a$2b] , pos($buk[3]) = $i while $buk[3] =~ m[(.)(...)\1]g; ], dio1=> q[ $dio[1] = $str; 1 while $dio[1] =~ s/(.)(...)\1/a$2b/; ], dio2=> q[ $dio[2] = $str; $r = 0; 1 while substr($dio[2],$r) =~ s/(.)(...)\1(?{$r = pos() - 3})/a$ +2b/; ], dio3=> q[ $dio[3] = $str; $r = 0; 1 while substr($dio[3],$r) =~ s[(.)(...)\1(?{$r +=pos() - 4})][a +$2b]; ], enlil=> q[{ $enlil[1] = $str; our $i = 0; pos($enlil[1]) = $i - 4 while $enlil[1] =~ s[\G (.*?) (.) (...) \2 (?{ $i = pos() }) +] [ defined $1 ? $1.'a'.$3.'b' : 'a'.$3.'b' +]ex; }], }; # A bug with Lvalue refs cause Ari1 to segfault if executed more than +twice! delete $test->{ari1}, $ari[1] = 'Test not performed' if $N < 1 or $ +N > 2; cmpthese( $N||-3, $test ); exit unless $CHECK; print " Ari1:$ari[1] buk1:$buk[1] buk2:$buk[2] buk3:$buk[3] dio1:$dio[1] dio2:$dio[2] dio3:$dio[3] enll:$enlil[1] "; __END__

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Replies are listed 'Best First'.
Re^4: Replace zero-width grouping? (faster!)
by Aristotle (Chancellor) on May 09, 2003 at 06:22 UTC

    It's the localization of the glob that somehow messes things up (actually, it appears to happen when Perl tries to restore it upon exiting the block). It doesn't really matter though. The principle is exactly the same as your "dio3" code, just more convoluted.

    My intuition says the crucial bottleneck is the code assertion in the pattern, rather than the math. "buk3" beats "dio3" mostly, and the former certainly does more math than the latter.

    One thing that occured to me is that when you have 4 substitutions in a row, you can skip the next 4 characters, since they'll all have been replaced. I can't think of a simple way to account for this fact in code though. Without a working single pattern solution it would require a nested loop, which is most probably not worth the effort. And if we had a working single pattern solution, there'd not probably be any manual optimization that could beat just letting the regex engine do its job.

    Update: thinking about this gave me an idea.

    #!/usr/bin/perl -wl use strict; my @ari; my $str = "17341234173412341734123417341234"; $ari[2] = $str; substr($ari[2], $_*4, 8) =~ s[(.)(?=...\1)|(?<=(.)...)\2]{ defined $1 ? "A" : "B" }eg for 0 .. length($ari[2])/4; print $ari[2]; __END__ A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB

    (Strange variable names left intact for easy addition to your benchmark.) It is consistently 4% faster than "buk3" on my machine.

    All that said and done, unless I had a real performance problem with this task, I'd use diotalevi's initial simpleminded approach in production. It works correctly and is far easier to read than any of the alternatives.

    Makeshifts last the longest.

      Nice one++. I like the step back you took. The OP's original 2 pass idea applied to a limited range that avoids its problem and Whamo! Fewer, bigger chunks and more of the work done by the regex engine. Neat. And In most of the variations I tried it was 30% to 36% quicker than the next fastest. The only time it looses out is on a very sparse string, but that's inevitable.

      As I believe was once said to Oscar Wilde, "I wish I'd thought of that" :)

      Your right on the code block assertion too, -- I wondered what the right term for that was -- it is the biggest hit on performance of those solutions.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

        I'm not sure "code block assertion" is the proper name, actually. (Maybe the Camel has something to say on the matter; perlre doesn't.) I just made up a term that made clear what I was talking about.

        Re "wish I'd thought of that": :) Remember the strategy for optimizing Perl code is to keep the execution of as much of an algorithm's logic as possible in the perl binary. The GRT is an impressive demonstration of this principle.

        Makeshifts last the longest.

Re: Re: Re^2: Replace zero-width grouping?
by hv (Prior) on May 09, 2003 at 11:36 UTC

    For the record, the lvalue-substr-localised-in-glob problem seems to be fixed in the latest development version of perl. Here are the results I get:

    zen% t0 -N=-1 -LEN=1 Rate ari1 enlil buk2 buk3 buk1 dio2 dio3 dio1 ari1 23424/s -- -8% -14% -15% -16% -20% -21% -36% enlil 25599/s 9% -- -6% -8% -8% -13% -13% -31% buk2 27306/s 17% 7% -- -1% -2% -7% -8% -26% buk3 27676/s 18% 8% 1% -- -1% -5% -6% -25% buk1 27926/s 19% 9% 2% 1% -- -5% -5% -24% dio2 29256/s 25% 14% 7% 6% 5% -- -1% -21% dio3 29537/s 26% 15% 8% 7% 6% 1% -- -20% dio1 36885/s 57% 44% 35% 33% 32% 26% 25% -- zen% t0 -N=-1 -LEN=10 Rate dio1 ari1 dio2 enlil buk2 buk3 buk1 dio3 dio1 1450/s -- -8% -21% -49% -50% -50% -51% -54% ari1 1569/s 8% -- -15% -44% -45% -46% -47% -50% dio2 1846/s 27% 18% -- -34% -36% -36% -38% -42% enlil 2817/s 94% 80% 53% -- -2% -3% -5% -11% buk2 2875/s 98% 83% 56% 2% -- -1% -3% -9% buk3 2901/s 100% 85% 57% 3% 1% -- -2% -8% buk1 2955/s 104% 88% 60% 5% 3% 2% -- -7% dio3 3169/s 118% 102% 72% 12% 10% 9% 7% -- zen% t0 -N=-1 -LEN=100 Rate dio1 ari1 dio2 enlil buk3 buk2 buk1 dio3 dio1 20.0/s -- -44% -45% -93% -93% -93% -93% -93% ari1 35.5/s 78% -- -3% -87% -88% -88% -88% -88% dio2 36.6/s 83% 3% -- -87% -87% -88% -88% -88% enlil 272/s 1262% 667% 644% -- -6% -7% -9% -11% buk3 290/s 1352% 718% 693% 7% -- -1% -3% -5% buk2 294/s 1368% 727% 702% 8% 1% -- -2% -4% buk1 299/s 1395% 742% 717% 10% 3% 2% -- -3% dio3 307/s 1437% 765% 740% 13% 6% 5% 3% -- zen% t0 -N=-1 -LEN=100 -SPARSE=5 Rate dio1 dio2 ari1 buk2 buk1 enlil dio3 buk3 dio1 3.48/s -- -48% -49% -96% -96% -98% -98% -98% dio2 6.73/s 94% -- -1% -92% -93% -96% -96% -97% ari1 6.80/s 95% 1% -- -92% -93% -96% -96% -97% buk2 89.1/s 2461% 1224% 1211% -- -4% -44% -50% -55% buk1 92.5/s 2558% 1274% 1260% 4% -- -42% -48% -54% enlil 160/s 4500% 2277% 2254% 80% 73% -- -10% -20% dio3 177/s 4993% 2532% 2507% 99% 92% 11% -- -11% buk3 199/s 5623% 2857% 2829% 123% 115% 24% 12% -- zen%

    Hugo