Re: Re^2: Replace zero-width grouping?

This has taken a while because I encountered problems with benchmarking your solution. I probably would have given up at that point except that the problem itself is an interesting, in that it's indicative of a whole class of problems that require making global substitutions on a string, where the subsequent matches can overlap with previous replacements. The regex engine does not appear to have a good solution for this problem currently.

The OP's original two pass solution doesn't scale once the matched string moves beyond the end of the original replacement.

Taking the loop out of the regex as originally proposed by diotalevi works, but suffers from the need to start searching at the beginning of the string each time.

The next step is to try and remember where the last match started, and arrange to start the next iteration at that position +1, but as hv pointed out, the convenience /cg modifier combo that can be applied to m//, doesn't work with s/// as pos get reset after substitutions.

You, diotalevi and Enlil all came up with strategies for saving and restoring pos before / after substitution that work with varing degrees of success.

I tried a crude method of moving the substitution through the string using substr and a for loop, which -- despite forcing the test against every possible n-length substring, rather than allowing the regex engine to find the matches for itself -- stands up surprisingly well to benchmarking.

Its main advantage seems to be the simplicity. All the attempts to allow the regex engine to do the bulk of the searching (which it does so well in most cases) are offset by the need to make two or more function calls, do math and access additional variables.

So, should anyone else but me be interested in these esoteric findings, here is the benchmark and some results. I've haven't found a way around the trap that happens when your Lvalue ref solution is called more than twice, so I've had to specifically eliminate it from most of the tests. I've included the results from a couple of 1-pass runs on very long strings (one packed and one sparse) by way of comparison.

Every time I look at this, I still think that there is a simple solution that I am missing, but it has alluded me so far.

Results

D:\Perl\test>256024 -LEN=1000 -N=1 -CHECK

      s/iter  dio1  ari1  dio2 enlil  dio3  buk3  buk1  bu
dio1    26.1    --  -46%  -46%  -95%  -96%  -98%  -99%  -9
ari1    14.2   84%    --    0%  -90%  -92%  -97%  -98%  -9
dio2    14.2   84%    0%    --  -90%  -92%  -97%  -98%  -9
enlil   1.41 1748%  906%  906%    --  -20%  -72%  -77%  -7
dio3    1.13 2205% 1155% 1155%   25%    --  -65%  -71%  -7
buk3   0.401 6408% 3444% 3444%  252%  182%    --  -17%  -1
buk1   0.331 7785% 4193% 4193%  327%  242%   21%    --   -
buk2   0.330 7808% 4206% 4206%  328%  243%   22%    0%

Ari1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...
buk1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...
buk2:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... 
buk3:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... 
dio1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...
dio2:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...
dio3:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... 
enll:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...

D:\Perl\test>256024 -LEN=1000 -N=1 -SPARSE=5 -CHECK
      s/iter   dio1   dio2   ari1  enlil   dio3   buk1   buk2   buk3
dio1     162     --   -44%   -46%   -93%   -95%  -100%  -100%  -100%
dio2    90.7    79%     --    -4%   -88%   -92%   -99%   -99%   -99%
ari1    87.5    85%     4%     --   -88%   -91%   -99%   -99%   -99%
enlil   10.6  1421%   752%   721%     --   -28%   -93%   -93%   -95%
dio3    7.67  2012%  1082%  1040%    39%     --   -90%   -90%   -94%
buk1   0.791 20376% 11364% 10956%  1246%   870%     --    -5%   -39%
buk2   0.752 21438% 11959% 11530%  1316%   920%     5%     --   -36%
buk3   0.480 33642% 18792% 18120%  2118%  1498%    65%    57%     --

Ari1:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
buk1:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
buk2:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
buk3:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
dio1:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
dio2:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
dio3:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
enll:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...

D:\Perl\test>256024 -LEN=1 -N=-3
        Rate  buk3  dio3 enlil  dio2  buk2  buk1  dio1
buk3  2190/s    --   -1%   -3%   -4%  -10%  -13%  -38%
dio3  2210/s    1%    --   -2%   -3%   -9%  -13%  -37%
enlil 2259/s    3%    2%    --   -1%   -7%  -11%  -36%
dio2  2272/s    4%    3%    1%    --   -6%  -10%  -35%
buk2  2430/s   11%   10%    8%    7%    --   -4%  -31%
buk1  2527/s   15%   14%   12%   11%    4%    --  -28%
dio1  3510/s   60%   59%   55%   54%   44%   39%    --

D:\Perl\test>256024 -LEN=10 -N=-3
       Rate  dio2  dio1 enlil  dio3  buk3  buk1  buk2
dio2  177/s    --  -15%  -20%  -20%  -28%  -39%  -40%
dio1  209/s   18%    --   -5%   -5%  -15%  -28%  -29%
enlil 219/s   24%    5%    --   -0%  -10%  -25%  -26%
dio3  220/s   25%    5%    0%    --  -10%  -25%  -26%
buk3  245/s   39%   17%   12%   11%    --  -16%  -17%
buk1  292/s   65%   40%   33%   33%   19%    --   -2%
buk2  296/s   68%   42%   35%   35%   21%    2%    --

D:\Perl\test>256024 -LEN=100 -N=-3
        Rate  dio1  dio2  dio3 enlil  buk3  buk1  buk2
dio1  3.74/s    --  -36%  -84%  -84%  -85%  -87%  -87%
dio2  5.87/s   57%    --  -74%  -75%  -76%  -80%  -80%
dio3  22.8/s  510%  289%    --   -3%   -6%  -22%  -23%
enlil 23.4/s  526%  299%    3%    --   -3%  -20%  -21%
buk3  24.3/s  548%  313%    6%    3%    --  -17%  -18%
buk1  29.2/s  680%  398%   28%   25%   21%    --   -2%
buk2  29.7/s  694%  406%   30%   27%   23%    2%    --
[download]

Benchmark

#! perl -slw
use strict;
use vars qw[$LEN $N $SPARSE $CHECK];
use Benchmark qw[cmpthese];

our $str = '17341234';
$str .= '....____' x $SPARSE if $SPARSE;
$str x= ($LEN || 5);

our (@ari, @buk, @dio, @enlil);

my $test = {
    ari1=> q[
      $ari[1] = $str;
      local *_ = do{ \$ari[1] };
      our $i = 0;
      *_ = \substr( $ari[1], $i ) while s[(.)(...)\1(?{ $i = pos() - 4
+ })][a$2b];
    ],
    buk1=> q[
      $buk[1] = $str;
      substr($buk[1], $_, 5) =~ s[(.)(...)\1][a$2b] for 0 .. length ($
+buk[1]);
    ],
    buk2=> q[
      $buk[2] = $str;
      for( 0 .. length($buk[2]) ) { substr($buk[2], $_, 5) =~ s[(.)(..
+.)\1][a$2b] };
    ],
    buk3=> q[
      $buk[3] =  $str;
      my $i;
      $i = pos($buk[3])-5
      , substr( $buk[3], $i, 5) =~ s[(.)(...)\1][a$2b]
      , pos($buk[3]) = $i
              while $buk[3] =~ m[(.)(...)\1]g;
    ],
    dio1=> q[
      $dio[1] = $str;
      1 while $dio[1] =~ s/(.)(...)\1/a$2b/;
    ],
    dio2=> q[
      $dio[2] = $str;
      $r = 0;
      1 while substr($dio[2],$r) =~ s/(.)(...)\1(?{$r = pos() - 3})/a$
+2b/;
    ],
    dio3=> q[
      $dio[3] = $str;
      $r = 0;
      1 while substr($dio[3],$r) =~ s[(.)(...)\1(?{$r +=pos() - 4})][a
+$2b];
    ],
    enlil=> q[{
      $enlil[1] = $str;
      our $i = 0;
      pos($enlil[1]) = $i - 4
        while $enlil[1] =~ s[\G (.*?) (.) (...) \2 (?{ $i = pos() })  
+]
                            [ defined $1 ? $1.'a'.$3.'b' : 'a'.$3.'b' 
+]ex;
    }],
};

# A bug with Lvalue refs cause Ari1 to segfault if executed more than 
+twice!
delete $test->{ari1}, $ari[1] = 'Test not performed'    if $N < 1 or $
+N > 2;

cmpthese( $N||-3, $test );

exit unless $CHECK;
print "
Ari1:$ari[1]
buk1:$buk[1]
buk2:$buk[2]
buk3:$buk[3]
dio1:$dio[1]
dio2:$dio[2]
dio3:$dio[3] 
enll:$enlil[1]
";

__END__
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Comment on Re: Re^2: Replace zero-width grouping? Select or Download Code

Replies are listed 'Best First'.
Re^4: Replace zero-width grouping? (faster!) by Aristotle (Chancellor) on May 09, 2003 at 06:22 UTC
It's the localization of the glob that somehow messes things up (actually, it appears to happen when Perl tries to restore it upon exiting the block). It doesn't really matter though. The principle is exactly the same as your "dio3" code, just more convoluted. My intuition says the crucial bottleneck is the code assertion in the pattern, rather than the math. "buk3" beats "dio3" mostly, and the former certainly does more math than the latter. One thing that occured to me is that when you have 4 substitutions in a row, you can skip the next 4 characters, since they'll all have been replaced. I can't think of a simple way to account for this fact in code though. Without a working single pattern solution it would require a nested loop, which is most probably not worth the effort. And if we had a working single pattern solution, there'd not probably be any manual optimization that could beat just letting the regex engine do its job. Update: thinking about this gave me an idea. `#!/usr/bin/perl -wl use strict; my @ari; my $str = "17341234173412341734123417341234"; $ari[2] = $str; substr($ari[2], $_4, 8) =~ s[(.)(?=...\1)\|(?<=(.)...)\2]{ defined $1 ? "A" : "B" }eg for 0 .. length($ari[2])/4; print $ari[2]; __END__ A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB` [download] (Strange variable names left intact for easy addition to your benchmark.) It is consistently 4% faster than "buk3" on my machine. All that said and done, unless I had a real performance problem with this task, I'd use diotalevi's initial simpleminded approach in production. It works correctly and is far easier to read than any of the alternatives. Makeshifts last the longest.*	[reply] [d/l]
Re: Re^4: Replace zero-width grouping? by BrowserUk (Patriarch) on May 09, 2003 at 07:24 UTC
Nice one++. I like the step back you took. The OP's original 2 pass idea applied to a limited range that avoids its problem and Whamo! Fewer, bigger chunks and more of the work done by the regex engine. Neat. And In most of the variations I tried it was 30% to 36% quicker than the next fastest. The only time it looses out is on a very sparse string, but that's inevitable. Read more... (2 kB) As I believe was once said to Oscar Wilde, "I wish I'd thought of that" :) Your right on the code block assertion too, -- I wondered what the right term for that was -- it is the biggest hit on performance of those solutions. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re^6: Replace zero-width grouping? (optimizing perl code) by Aristotle (Chancellor) on May 09, 2003 at 09:55 UTC
I'm not sure "code block assertion" is the proper name, actually. (Maybe the Camel has something to say on the matter; perlre doesn't.) I just made up a term that made clear what I was talking about. Re "wish I'd thought of that": `:)` Remember the strategy for optimizing Perl code is to keep the execution of as much of an algorithm's logic as possible in the `perl` binary. The GRT is an impressive demonstration of this principle. Makeshifts last the longest.	[reply]
Re: Re^6: Replace zero-width grouping? by diotalevi (Canon) on May 09, 2003 at 13:41 UTC
Re^8: Replace zero-width grouping? ( (??{ }) ?) by Aristotle (Chancellor) on May 09, 2003 at 18:58 UTC
Some notes below your chosen depth have not been shown here
Re: Re^6: Replace zero-width grouping? by BrowserUk (Patriarch) on May 09, 2003 at 15:37 UTC
Re^8: Replace zero-width grouping? by hv (Prior) on May 09, 2003 at 18:14 UTC
Some notes below your chosen depth have not been shown here
Re: Re: Re^2: Replace zero-width grouping? by hv (Prior) on May 09, 2003 at 11:36 UTC
For the record, the lvalue-substr-localised-in-glob problem seems to be fixed in the latest development version of perl. Here are the results I get: zen% t0 -N=-1 -LEN=1 Rate ari1 enlil buk2 buk3 buk1 dio2 dio3 dio1 ari1 23424/s -- -8% -14% -15% -16% -20% -21% -36% enlil 25599/s 9% -- -6% -8% -8% -13% -13% -31% buk2 27306/s 17% 7% -- -1% -2% -7% -8% -26% buk3 27676/s 18% 8% 1% -- -1% -5% -6% -25% buk1 27926/s 19% 9% 2% 1% -- -5% -5% -24% dio2 29256/s 25% 14% 7% 6% 5% -- -1% -21% dio3 29537/s 26% 15% 8% 7% 6% 1% -- -20% dio1 36885/s 57% 44% 35% 33% 32% 26% 25% -- zen% t0 -N=-1 -LEN=10 Rate dio1 ari1 dio2 enlil buk2 buk3 buk1 dio3 dio1 1450/s -- -8% -21% -49% -50% -50% -51% -54% ari1 1569/s 8% -- -15% -44% -45% -46% -47% -50% dio2 1846/s 27% 18% -- -34% -36% -36% -38% -42% enlil 2817/s 94% 80% 53% -- -2% -3% -5% -11% buk2 2875/s 98% 83% 56% 2% -- -1% -3% -9% buk3 2901/s 100% 85% 57% 3% 1% -- -2% -8% buk1 2955/s 104% 88% 60% 5% 3% 2% -- -7% dio3 3169/s 118% 102% 72% 12% 10% 9% 7% -- zen% t0 -N=-1 -LEN=100 Rate dio1 ari1 dio2 enlil buk3 buk2 buk1 dio3 dio1 20.0/s -- -44% -45% -93% -93% -93% -93% -93% ari1 35.5/s 78% -- -3% -87% -88% -88% -88% -88% dio2 36.6/s 83% 3% -- -87% -87% -88% -88% -88% enlil 272/s 1262% 667% 644% -- -6% -7% -9% -11% buk3 290/s 1352% 718% 693% 7% -- -1% -3% -5% buk2 294/s 1368% 727% 702% 8% 1% -- -2% -4% buk1 299/s 1395% 742% 717% 10% 3% 2% -- -3% dio3 307/s 1437% 765% 740% 13% 6% 5% 3% -- zen% t0 -N=-1 -LEN=100 -SPARSE=5 Rate dio1 dio2 ari1 buk2 buk1 enlil dio3 buk3 dio1 3.48/s -- -48% -49% -96% -96% -98% -98% -98% dio2 6.73/s 94% -- -1% -92% -93% -96% -96% -97% ari1 6.80/s 95% 1% -- -92% -93% -96% -96% -97% buk2 89.1/s 2461% 1224% 1211% -- -4% -44% -50% -55% buk1 92.5/s 2558% 1274% 1260% 4% -- -42% -48% -54% enlil 160/s 4500% 2277% 2254% 80% 73% -- -10% -20% dio3 177/s 4993% 2532% 2507% 99% 92% 11% -- -11% buk3 199/s 5623% 2857% 2829% 123% 115% 24% 12% -- zen% [download] Hugo	[reply] [d/l]