in reply to Re^3: regex in REPLACEMENT in s///
in thread regex in REPLACEMENT in s///

Greetings Monks, especially hippo :-)

I had no idea when I posted this question it would attract so many Monks, and what's more, so much of the discussion would be about grouping, and the inference I derive from that is that I offended Perl sensibilities by using () when it wasn't necessary. My bad, but that deserves a bit of discussion, especially when hippo included this expression in his rejoinder, to wit:

    use 5.020; # for best efficiency on $&

I don't change my perl very often and I depend most heavily for documentation on 'Programming Perl', 3rd and 4th. In the early days I used whatever perl was installed on my employer's or client's machines. When I first put perl up on a personal machine it was 5.6.1, consistent with PP 3rd, and then 5.18 (close to PP 4th's 5.16). Later I jumped to 5.34 and now 5.36, but I still rely on PP, which advises against using any of $`, $& and $'. I used them only when debugging a difficult regex I was writing. Now I just use Regexp::Debugger. So hippo's pragma invocation sent me off on a search to find an explanation why so many Monks are using $&. Here is what I found in perlvar:

   In Perl 5.20.0 a new copy-on-write system was enabled by default, which
   finally fixes all performance issues with these three variables, and
   makes them safe to use anywhere.

So, I now understand iff the notion you all are trying to convey to me is to use only as much syntax as is necessary to get the job done. I infer from that, that you would only use grouping if you were trying to pick two or more sub-expressions out of a string. If only one sub-expression is desired, then () are not needed and $& is spot on. But, TMTOWTDI, so I supplemented hippo's revision with my own code shown below, but that code is also worthy of comment, and discussion from y'all, if you care to comment.

I have long been fascinated by the commify problem I first encountered in Friedl, and I've spent many enjoyable hours trying to hack a regex that could commify reals that have fractional parts greater that 3 digits, to avoid 123,456.334,56. After grasping from the comments of several Monks that CODE in s/PATTERN/CODE/e is really a code block, or just another piece of perl code that can be eval-ed with /e, and the last expression evaluated is returned (except, of course, when using /r), I could see the awesome power of the notion of a regex within a regex, which was exactly the purpose of my original post, to get that usage right. That also explains the summed boolean results in my original post since the s/// regex itself is the last expression evaluated in that block; six iterations with \g, thus '6.56'. Now, please note my new commify solution as depicted in the following which solves the problem of not commify-ing the fraction part:

use strict; use 5.020; # for best efficiency on $& use Test::More tests => 11; my $orig = 'This is a real number, 123456.56'; my $want = 'This is a real number, 493824.56'; (my $have = $orig) =~ s/\d+/sprintf "%i", 4 * $&/e; my $intpart = $&; is $intpart, 123456, 'int part captured'; is $have, $want, 'Multiply int part by 4'; ($have = $intpart) =~ s/./3/g; is $have, 333333, 'Int part digits set to "3" trivially, with /./'; ($have = $intpart) =~ s/\d/3/g; is $have, 333333, 'Int part digits set to "3" trivially, with /\\d/'; $want = 'This is a real number, 333333.56'; ($have = $orig) =~ s/\d+/(my $int = $&) =~ s#.#3#g; $int/e; is $have, $want, 'Replace int part in string with all 3s'; $want = 'This is a real number, 123,456.56'; ($have = $orig) =~ s{ \d+ }{ $& =~ s/ (?<=\d) (?= (?:\d{3} )+ (?!\d) ) /,/xrg; }ex; $intpart = $&; is $intpart, 123456, '$& int part captured'; is $have, $want, 'Insert commas where appropriate'; ($have = $orig) =~ s{ (\d+) }{ $1 =~ s/ (?<=\d) (?= (?:\d{3} )+ (?!\d) ) /,/xrg; }ex; $intpart = $1; is $intpart, 123456, '$1 int part captured'; is $have, $want, 'Insert commas where appropriate'; ($have = $orig) =~ s{ (?<int> (\d+)) }{ $+{int} =~ s/ (?<=\d) (?= (?:\d{3} )+ (?!\d) ) /,/xrg; }ex; $intpart = $+{int}; is $intpart, 123456, '<int> part captured'; is $have, $want, 'Insert commas where appropriate'; exit(0); __END__

It also works with 123456.56567. Try it. So, one solution but use of grouping with $1 and $+{name}, but again, I concede that $& is right-on unless more that one sub-expression is targeted. Is that a fair rendering of the grouping issues in this dialogue? If not, I'm still a student of perl, since next to calculus, perl is one of the world's greatest inventions :-) Also, I am enough of a nerd to want to learn more about the changes wrought in 5.20, that copy-on-write system, that make $& safe to use. Please suggest a homework reading assignment? Thanks again for a very stimulating discussion.

Replies are listed 'Best First'.
Re^5: regex in REPLACEMENT in s///
by tybalt89 (Monsignor) on Sep 14, 2023 at 21:27 UTC

    Maybe slightly cleaner ?

    (123456.56567 * 4) =~ s{ \d+ }{ $& =~ s/ \B (?= (?:\d{3})+ $ ) /,/xgr +}xer
    outputs
    493,826.26268

      Greetings tybalt89

      I wanted to see if I could make yours work with mine, and with a minor tweak, I did. I prefer to use assignment with s///, as in:

         (my $tybalt89 = $str) =~ s/...
      

      You did not; to each his own, but to make yours work in my preferred model I removed the /r from the outer regex expression. Once I did that I was able to replace my CODE statement with yours, as a drop-in. I teach math to undergrads with Perl, Python and R, and I have to be able to answer their questions, so I had to understand exactly what is going on here. I also added an extra digit, to see multiple ',' insertions and put the real number into a string. TMTOWTDI, works great, as in:

      use 5.20.0; use strict; my $n = 4; my $raw = (1234567.56567); my $orig = (1234567.56567 * $n); say "raw real number: $raw"; say "factored real nbr: $orig"; my $str = "This is a real number ${\($raw * $n)}."; say $str; (my $tybalt89 = $str) =~ s{ \d+ }{ $& =~ s/ \B (?= (?:\d{3})+ $ ) /,/x +rg }xe; say "tybalt89's with assignment and \\r tweak => \'$tybalt89\'"; (my $perlboy = (($raw) =~ s{ \d+\.?\d* }{ $& * 4 }erx )) =~ s{ \d+ }{ $& =~ s/ (?<=\d) (?= (?:\d{3} )+ (?!\d) ) /,/xrg; }ex; say "perlboy's => \'$perlboy\'"; (my $mixed = (($raw) =~ s{ \d+\.?\d* }{ $& * 4 }erx )) =~ s{ \d+ }{ $& =~ s/ \B (?= (?:\d{3})+ $ ) /,/xrg }xe; say "perboy's w/tybalt89's drop-in with \\r tweak => \'$mixed\'"; say "tybalt89's original regex, without assignment => ", ($str) =~ s{ +\d+ }{ $& =~ s/ \B (?= (?:\d{3})+ $ ) /,/xgr }xer; say "tybalt89's original regex, without assignment => ", (1234567.5656 +7 * 4) =~ s{ \d+ }{ $& =~ s/ \B (?= (?:\d{3})+ $ ) /,/xgr }xer; exit(0) __END__

      which when run yields:

         raw real number:   1234567.56567
         factored real nbr: 4938270.26268
         This is a real number 4938270.26268.
         tybalt89's with assignment and \r tweak => 'This is a real number 4,938,270.26268.'
         perlboy's => '4,938,270.26268'
         perboy's w/tybalt89's drop-in with \r tweak => '4,938,270.26268'
         tybalt89's original regex, without assignment => This is a real number 4,938,270.26268.
         tybalt89's original regex, without assignment => 4,938,270.26268
      

      As I said, the only change to yours was to remove /r from the outer s///. To verify it would insert multiple ',', I added an extra digit and interpolated the real number into a string. So, that \B assertion works, and by removing two look-arounds, probably runs faster than mine.

      
      UPDATE 9/17/2023
      
         Running timethese(-10, {...));
         Benchmark: running perlboy, tybalt89 for at least 10 CPU seconds...
         perlboy: 10 wallclock secs (10.52 usr +  0.00 sys = 10.52 CPU) @ 25551.33/s (n=268800)
        tybalt89: 11 wallclock secs (10.45 usr +  0.00 sys = 10.45 CPU) @ 27807.18/s (n=290585)
      
         Running cmpthese(-10, {...});
                     Rate  perlboy tybalt89
         perlboy  25623/s       --      -8%
         tybalt89 27900/s       9%       --
      
      

      Cheers, and happy perl-ing

      Nice try, but that was not my use case. I only multiplied the int part to prove I could do it from the CODE part of s///. Now let's say that IS my use case, that I want to multiply AND commify WITHOUT grouping, and the real number MUST be in a string, as in hippo's test paradigm, to wit:

         $orig = 'This is a real number, 123456.56567';
         $want = 'This is a real number, 493,826.26268';
      

      I can do it with mine but not yours, as in:

      $want = 'This is a real number, 493,826.26268'; ($have = (($orig) =~ s{ \d+\.?\d* }{ $& * 4 }erx )) =~ s{ \d+ }{ $& =~ s/ (?<=\d) (?= (?:\d{3} )+ (?!\d) ) /,/xrg; }ex; $intpart = $&; is $intpart, 493826, '$& int part captured'; is $have, $want, 'Insert commas where appropriate';

      And as the Monks prefer, no grouping () to be seen. That is three s/// in a one-liner. First, grab the real number and factor it; next, grab just the int part and commify it; finally, put it back in the string. Man, s/PATTERN/CODE/e is potent mojo.

        And as the Monks prefer, no grouping () to be seen.

        Just to nail this point, I doubt anyone has good grounds to object to relevant use of grouping. The problem is that by default using brackets creates a capture group and this results in 2 levels of potential inefficiency (the group and the capture). And that's not to say that capture groups are bad - far from it. There is just no point in paying the performance penalty if you don't need the capturing.

        The inefficiencies being discussed here are obviously small but if you are using such a regex in the middle of tight loop which runs millions/billions of times then these things mount up so it is as well to be aware of them. I wouldn't go tying myself in knots to avoid them in cases where they don't matter.

        In general:

        • If you have a need for capture groups, use them but be aware of the potential performance penalty
        • If you have a need for groups but not capture, use them but ensure they are non-capturing using the methods already outlined by haukex
        • If you don't have a need for groups, don't use them

        We've all seen code where people have sprinkled unnecessary characters. In most cases it has no effect (good or bad) and just makes the code harder to read and maintain. With brackets in regex there is an effect and the price is worth paying if your code makes use of the functionality. But if you don't need it then it is best avoided.


        🦛