comment on

This has taken a while because I encountered problems with benchmarking your solution. I probably would have given up at that point except that the problem itself is an interesting, in that it's indicative of a whole class of problems that require making global substitutions on a string, where the subsequent matches can overlap with previous replacements. The regex engine does not appear to have a good solution for this problem currently.

The OP's original two pass solution doesn't scale once the matched string moves beyond the end of the original replacement.

Taking the loop out of the regex as originally proposed by diotalevi works, but suffers from the need to start searching at the beginning of the string each time.

The next step is to try and remember where the last match started, and arrange to start the next iteration at that position +1, but as hv pointed out, the convenience /cg modifier combo that can be applied to m//, doesn't work with s/// as pos get reset after substitutions.

You, diotalevi and Enlil all came up with strategies for saving and restoring pos before / after substitution that work with varing degrees of success.

I tried a crude method of moving the substitution through the string using substr and a for loop, which -- despite forcing the test against every possible n-length substring, rather than allowing the regex engine to find the matches for itself -- stands up surprisingly well to benchmarking.

Its main advantage seems to be the simplicity. All the attempts to allow the regex engine to do the bulk of the searching (which it does so well in most cases) are offset by the need to make two or more function calls, do math and access additional variables.

So, should anyone else but me be interested in these esoteric findings, here is the benchmark and some results. I've haven't found a way around the trap that happens when your Lvalue ref solution is called more than twice, so I've had to specifically eliminate it from most of the tests. I've included the results from a couple of 1-pass runs on very long strings (one packed and one sparse) by way of comparison.

Every time I look at this, I still think that there is a simple solution that I am missing, but it has alluded me so far.

Results

D:\Perl\test>256024 -LEN=1000 -N=1 -CHECK

      s/iter  dio1  ari1  dio2 enlil  dio3  buk3  buk1  bu
dio1    26.1    --  -46%  -46%  -95%  -96%  -98%  -99%  -9
ari1    14.2   84%    --    0%  -90%  -92%  -97%  -98%  -9
dio2    14.2   84%    0%    --  -90%  -92%  -97%  -98%  -9
enlil   1.41 1748%  906%  906%    --  -20%  -72%  -77%  -7
dio3    1.13 2205% 1155% 1155%   25%    --  -65%  -71%  -7
buk3   0.401 6408% 3444% 3444%  252%  182%    --  -17%  -1
buk1   0.331 7785% 4193% 4193%  327%  242%   21%    --   -
buk2   0.330 7808% 4206% 4206%  328%  243%   22%    0%

Ari1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...
buk1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...
buk2:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... 
buk3:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... 
dio1:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...
dio2:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...
dio3:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ... 
enll:a7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab2bba7aab ...

D:\Perl\test>256024 -LEN=1000 -N=1 -SPARSE=5 -CHECK
      s/iter   dio1   dio2   ari1  enlil   dio3   buk1   buk2   buk3
dio1     162     --   -44%   -46%   -93%   -95%  -100%  -100%  -100%
dio2    90.7    79%     --    -4%   -88%   -92%   -99%   -99%   -99%
ari1    87.5    85%     4%     --   -88%   -91%   -99%   -99%   -99%
enlil   10.6  1421%   752%   721%     --   -28%   -93%   -93%   -95%
dio3    7.67  2012%  1082%  1040%    39%     --   -90%   -90%   -94%
buk1   0.791 20376% 11364% 10956%  1246%   870%     --    -5%   -39%
buk2   0.752 21438% 11959% 11530%  1316%   920%     5%     --   -36%
buk3   0.480 33642% 18792% 18120%  2118%  1498%    65%    57%     --

Ari1:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
buk1:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
buk2:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
buk3:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
dio1:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
dio2:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
dio3:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...
enll:a7aab2bb....____....____....____....____....____a7aab2bb....____.
+...

D:\Perl\test>256024 -LEN=1 -N=-3
        Rate  buk3  dio3 enlil  dio2  buk2  buk1  dio1
buk3  2190/s    --   -1%   -3%   -4%  -10%  -13%  -38%
dio3  2210/s    1%    --   -2%   -3%   -9%  -13%  -37%
enlil 2259/s    3%    2%    --   -1%   -7%  -11%  -36%
dio2  2272/s    4%    3%    1%    --   -6%  -10%  -35%
buk2  2430/s   11%   10%    8%    7%    --   -4%  -31%
buk1  2527/s   15%   14%   12%   11%    4%    --  -28%
dio1  3510/s   60%   59%   55%   54%   44%   39%    --

D:\Perl\test>256024 -LEN=10 -N=-3
       Rate  dio2  dio1 enlil  dio3  buk3  buk1  buk2
dio2  177/s    --  -15%  -20%  -20%  -28%  -39%  -40%
dio1  209/s   18%    --   -5%   -5%  -15%  -28%  -29%
enlil 219/s   24%    5%    --   -0%  -10%  -25%  -26%
dio3  220/s   25%    5%    0%    --  -10%  -25%  -26%
buk3  245/s   39%   17%   12%   11%    --  -16%  -17%
buk1  292/s   65%   40%   33%   33%   19%    --   -2%
buk2  296/s   68%   42%   35%   35%   21%    2%    --

D:\Perl\test>256024 -LEN=100 -N=-3
        Rate  dio1  dio2  dio3 enlil  buk3  buk1  buk2
dio1  3.74/s    --  -36%  -84%  -84%  -85%  -87%  -87%
dio2  5.87/s   57%    --  -74%  -75%  -76%  -80%  -80%
dio3  22.8/s  510%  289%    --   -3%   -6%  -22%  -23%
enlil 23.4/s  526%  299%    3%    --   -3%  -20%  -21%
buk3  24.3/s  548%  313%    6%    3%    --  -17%  -18%
buk1  29.2/s  680%  398%   28%   25%   21%    --   -2%
buk2  29.7/s  694%  406%   30%   27%   23%    2%    --
[download]

Benchmark

#! perl -slw
use strict;
use vars qw[$LEN $N $SPARSE $CHECK];
use Benchmark qw[cmpthese];

our $str = '17341234';
$str .= '....____' x $SPARSE if $SPARSE;
$str x= ($LEN || 5);

our (@ari, @buk, @dio, @enlil);

my $test = {
    ari1=> q[
      $ari[1] = $str;
      local *_ = do{ \$ari[1] };
      our $i = 0;
      *_ = \substr( $ari[1], $i ) while s[(.)(...)\1(?{ $i = pos() - 4
+ })][a$2b];
    ],
    buk1=> q[
      $buk[1] = $str;
      substr($buk[1], $_, 5) =~ s[(.)(...)\1][a$2b] for 0 .. length ($
+buk[1]);
    ],
    buk2=> q[
      $buk[2] = $str;
      for( 0 .. length($buk[2]) ) { substr($buk[2], $_, 5) =~ s[(.)(..
+.)\1][a$2b] };
    ],
    buk3=> q[
      $buk[3] =  $str;
      my $i;
      $i = pos($buk[3])-5
      , substr( $buk[3], $i, 5) =~ s[(.)(...)\1][a$2b]
      , pos($buk[3]) = $i
              while $buk[3] =~ m[(.)(...)\1]g;
    ],
    dio1=> q[
      $dio[1] = $str;
      1 while $dio[1] =~ s/(.)(...)\1/a$2b/;
    ],
    dio2=> q[
      $dio[2] = $str;
      $r = 0;
      1 while substr($dio[2],$r) =~ s/(.)(...)\1(?{$r = pos() - 3})/a$
+2b/;
    ],
    dio3=> q[
      $dio[3] = $str;
      $r = 0;
      1 while substr($dio[3],$r) =~ s[(.)(...)\1(?{$r +=pos() - 4})][a
+$2b];
    ],
    enlil=> q[{
      $enlil[1] = $str;
      our $i = 0;
      pos($enlil[1]) = $i - 4
        while $enlil[1] =~ s[\G (.*?) (.) (...) \2 (?{ $i = pos() })  
+]
                            [ defined $1 ? $1.'a'.$3.'b' : 'a'.$3.'b' 
+]ex;
    }],
};

# A bug with Lvalue refs cause Ari1 to segfault if executed more than 
+twice!
delete $test->{ari1}, $ari[1] = 'Test not performed'    if $N < 1 or $
+N > 2;

cmpthese( $N||-3, $test );

exit unless $CHECK;
print "
Ari1:$ari[1]
buk1:$buk[1]
buk2:$buk[2]
buk3:$buk[3]
dio1:$dio[1]
dio2:$dio[2]
dio3:$dio[3] 
enll:$enlil[1]
";

__END__
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

In reply to Re: Re^2: Replace zero-width grouping? by BrowserUk
in thread Replace zero-width grouping? by tinypig

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.