Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Replace zero-width grouping?

by BrowserUk (Patriarch)
on May 06, 2003 at 21:49 UTC ( [id://256058]=note: print w/replies, xml ) Need Help??


in reply to Replace zero-width grouping?

Your two pass solution looks fine provided the string isn't any longer than the sample you supplied, but when the string gets longer, it falls down.

my $a = "17341234173412341734123417341234"; $a=~s/(.)(?=...\1)/A/g; $a=~s/(?<=A...)(.)/B/g; print $a; A7AAB2BBB7BBB2BBB7BBB2BBB7BBB2BB

diotalevi's solution works much better

my $a = "17341234173412341734123417341234"; my $n=0; $n++ while $a =~ s/(.)(...)\1/A$2B/; print $n, $a; 12 A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB

The downside is that as the length of the string grows, so do the number of passes, and it is having to re-scan the parts of the string it has already processed, each time through. For short strings this isn't a great problem, but if the strings are longer, then there is an alternative method that avoids it.

my $a = "17341234173412341734123417341234"; substr($a, $_, 5) =~ s[(.)(...)\1][A$2B] for 0 .. length ($a); print $a; A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB

By using substr as an lvalue for the substitution, you can perform process in a single (overlapping) pass that reduces the work done by the regex engine by only looking at each group of 5 characters at a time.

The difference in performance only really becomes evident once the string length gets above about 5 times the length of your original sample, but the technique is useful in its own right for some things.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Replies are listed 'Best First'.
Re: Re: Replace zero-width grouping?
by hv (Prior) on May 07, 2003 at 02:12 UTC

    I hoped to fix the two-pass version by adding a check to make sure this character wasn't already earmarked to become a "B" before accepting it to become an "A":

    s/(?<!A...)(.)(?=...\1)/A/g; s/(?<=A...)(.)/B/g;
    but that still fails in exactly the same way, because the regexp engine goes to some effort to make sure that the pattern is matched against the original unmodified string each time through an s///g.

    It is also unfortunate that s/// doesn't have the same support for things the //gc flags - it would have been handy to be able to solve this with something like:

    pos($_) -= 3 while s/(.)(...)\1/A$2B/gc;

    Hmm, frustrating - I feel sure there must be a simple and efficient solution.

    Hugo
      I don't know exactly if this is the type of solution you and BrowserUK were looking for (working towards), but it seems to work.. and should cut back on the rescanning thing. I left the (?{print pos($_)."\n"}) (it can be taken out if it is to much of an eyesore), to show where it starts matching, and uncommenting the use re 'debug'; line seems to confirm this. Anyhow here is the code:
      use strict; use warnings; #use re 'debug'; $_ = "17341234173412341734123417341234"; my $i; pos() = $i while s<\G (?{print pos($_) . "\n"}) (.*?)(.)(...)\2 (?{$i=pos()-4}) > <defined $1?$1."A".$3."B":"A".$3."B">ex; print; __END__ 0 1 3 4 9 11 12 17 19 20 25 27 28 A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB
      update: FWIW I did do much benchmarking using cmpthese in the benchmark module, and it appears that after all the hand waving, and tweaks, nothing seems to run faster than BrowserUK's substr solution, and probably the one I would go with for long strings of digits. But I would make one last change tweak first. That being changing the first . to a \d .
      That is:
      substr($a, $_, 5) =~ s[(\d)(...)\1][A$2B] for 0 .. length ($a);
      Which should speed up the work the regex engine has to do as it will skip all the positions that do not start with a digit (and hence have already been turned into a letter (: ) For that matter the original could have been made better doing the same:
      s!(\d)(...)\1!A$2B!;

      -enlil

      Yes. I kept feeling there was a solution there somewhere, but be darned if I could find it.

      I tried playing games with \G as part of a positive look-behind assertion, but as is documented in the 5.8 copy of perlre, this isn't really supported. Constraining the match with a lvalue substr was the best I could come up with.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re^2: Replace zero-width grouping? (s/// against substr)
by Aristotle (Chancellor) on May 07, 2003 at 12:38 UTC
    This is an iterative approach as well but more efficient:
    #!/usr/bin/perl -wl use strict; my $str = "17341234173412341734123417341234"; { local *_ = \$str; *_ = \substr $_, pos()-4 while s/(.)(...)\1/A$2B/; } print $str;

    I am exploiting the fact that assigning a ref to something to a glob creates an alias to that something; this also works for a ref to an lvalue such as substr returns. Only when a match happens, the substitution is restarted, and then in order to avoid rescanning, I hide the already processed front of the string by aliasing $_ to the unprocessed portion. In the ideal case of no match, this means the RE engine will only be invoked once, regardless of the string's length.

    Unfortunately, it doesn't work in this simple form because alas, s/// does not leave pos in a useful state after a substitution. The real code thus needs an ugly temporary variable and a messy addition to the substitution:

    my $i = 0; *_ = \substr $str, $i while s/(.)(...)\1(?{ $i += pos() - 4 })/A$2 +B/;

    The result is A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB.

    Update: I just noticed this is essentially the same as diotalevi is doing. Doh.

    Makeshifts last the longest.

      This has taken a while because I encountered problems with benchmarking your solution. I probably would have given up at that point except that the problem itself is an interesting, in that it's indicative of a whole class of problems that require making global substitutions on a string, where the subsequent matches can overlap with previous replacements. The regex engine does not appear to have a good solution for this problem currently.

      The OP's original two pass solution doesn't scale once the matched string moves beyond the end of the original replacement.

      Taking the loop out of the regex as originally proposed by diotalevi works, but suffers from the need to start searching at the beginning of the string each time.

      The next step is to try and remember where the last match started, and arrange to start the next iteration at that position +1, but as hv pointed out, the convenience /cg modifier combo that can be applied to m//, doesn't work with s/// as pos get reset after substitutions.

      You, diotalevi and Enlil all came up with strategies for saving and restoring pos before / after substitution that work with varing degrees of success.

      I tried a crude method of moving the substitution through the string using substr and a for loop, which -- despite forcing the test against every possible n-length substring, rather than allowing the regex engine to find the matches for itself -- stands up surprisingly well to benchmarking.

      Its main advantage seems to be the simplicity. All the attempts to allow the regex engine to do the bulk of the searching (which it does so well in most cases) are offset by the need to make two or more function calls, do math and access additional variables.

      So, should anyone else but me be interested in these esoteric findings, here is the benchmark and some results. I've haven't found a way around the trap that happens when your Lvalue ref solution is called more than twice, so I've had to specifically eliminate it from most of the tests. I've included the results from a couple of 1-pass runs on very long strings (one packed and one sparse) by way of comparison.

      Every time I look at this, I still think that there is a simple solution that I am missing, but it has alluded me so far.

      Results

      Benchmark


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

        It's the localization of the glob that somehow messes things up (actually, it appears to happen when Perl tries to restore it upon exiting the block). It doesn't really matter though. The principle is exactly the same as your "dio3" code, just more convoluted.

        My intuition says the crucial bottleneck is the code assertion in the pattern, rather than the math. "buk3" beats "dio3" mostly, and the former certainly does more math than the latter.

        One thing that occured to me is that when you have 4 substitutions in a row, you can skip the next 4 characters, since they'll all have been replaced. I can't think of a simple way to account for this fact in code though. Without a working single pattern solution it would require a nested loop, which is most probably not worth the effort. And if we had a working single pattern solution, there'd not probably be any manual optimization that could beat just letting the regex engine do its job.

        Update: thinking about this gave me an idea.

        #!/usr/bin/perl -wl use strict; my @ari; my $str = "17341234173412341734123417341234"; $ari[2] = $str; substr($ari[2], $_*4, 8) =~ s[(.)(?=...\1)|(?<=(.)...)\2]{ defined $1 ? "A" : "B" }eg for 0 .. length($ari[2])/4; print $ari[2]; __END__ A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB

        (Strange variable names left intact for easy addition to your benchmark.) It is consistently 4% faster than "buk3" on my machine.

        All that said and done, unless I had a real performance problem with this task, I'd use diotalevi's initial simpleminded approach in production. It works correctly and is far easier to read than any of the alternatives.

        Makeshifts last the longest.

        For the record, the lvalue-substr-localised-in-glob problem seems to be fixed in the latest development version of perl. Here are the results I get:

        zen% t0 -N=-1 -LEN=1 Rate ari1 enlil buk2 buk3 buk1 dio2 dio3 dio1 ari1 23424/s -- -8% -14% -15% -16% -20% -21% -36% enlil 25599/s 9% -- -6% -8% -8% -13% -13% -31% buk2 27306/s 17% 7% -- -1% -2% -7% -8% -26% buk3 27676/s 18% 8% 1% -- -1% -5% -6% -25% buk1 27926/s 19% 9% 2% 1% -- -5% -5% -24% dio2 29256/s 25% 14% 7% 6% 5% -- -1% -21% dio3 29537/s 26% 15% 8% 7% 6% 1% -- -20% dio1 36885/s 57% 44% 35% 33% 32% 26% 25% -- zen% t0 -N=-1 -LEN=10 Rate dio1 ari1 dio2 enlil buk2 buk3 buk1 dio3 dio1 1450/s -- -8% -21% -49% -50% -50% -51% -54% ari1 1569/s 8% -- -15% -44% -45% -46% -47% -50% dio2 1846/s 27% 18% -- -34% -36% -36% -38% -42% enlil 2817/s 94% 80% 53% -- -2% -3% -5% -11% buk2 2875/s 98% 83% 56% 2% -- -1% -3% -9% buk3 2901/s 100% 85% 57% 3% 1% -- -2% -8% buk1 2955/s 104% 88% 60% 5% 3% 2% -- -7% dio3 3169/s 118% 102% 72% 12% 10% 9% 7% -- zen% t0 -N=-1 -LEN=100 Rate dio1 ari1 dio2 enlil buk3 buk2 buk1 dio3 dio1 20.0/s -- -44% -45% -93% -93% -93% -93% -93% ari1 35.5/s 78% -- -3% -87% -88% -88% -88% -88% dio2 36.6/s 83% 3% -- -87% -87% -88% -88% -88% enlil 272/s 1262% 667% 644% -- -6% -7% -9% -11% buk3 290/s 1352% 718% 693% 7% -- -1% -3% -5% buk2 294/s 1368% 727% 702% 8% 1% -- -2% -4% buk1 299/s 1395% 742% 717% 10% 3% 2% -- -3% dio3 307/s 1437% 765% 740% 13% 6% 5% 3% -- zen% t0 -N=-1 -LEN=100 -SPARSE=5 Rate dio1 dio2 ari1 buk2 buk1 enlil dio3 buk3 dio1 3.48/s -- -48% -49% -96% -96% -98% -98% -98% dio2 6.73/s 94% -- -1% -92% -93% -96% -96% -97% ari1 6.80/s 95% 1% -- -92% -93% -96% -96% -97% buk2 89.1/s 2461% 1224% 1211% -- -4% -44% -50% -55% buk1 92.5/s 2558% 1274% 1260% 4% -- -42% -48% -54% enlil 160/s 4500% 2277% 2254% 80% 73% -- -10% -20% dio3 177/s 4993% 2532% 2507% 99% 92% 11% -- -11% buk3 199/s 5623% 2857% 2829% 123% 115% 24% 12% -- zen%

        Hugo

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://256058]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-04-25 14:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found