in reply to /g option not making s// find all matches

I like Eily's  next unless ...; approach++ best, but here's a variation on his or her  s///g solution using  \G that doesn't depend on hocus-pocusing pos:

c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; my @lines = ( 'A line with an_underscore.', 'A line with_two_underscores.', '~A line with an_underscore starting with a tilde.', '~A line with_two_underscores starting with a tilde.', ); ;; for my $line (@lines, @ARGV) { print qq{'$line'}; $line =~ s{ (?: \G (?! \A) | \A ~) .*? \K _ }{+}xmsg; print qq{'$line' \n}; } " "___" "~___" 'A line with an_underscore.' 'A line with an_underscore.' 'A line with_two_underscores.' 'A line with_two_underscores.' '~A line with an_underscore starting with a tilde.' '~A line with an+underscore starting with a tilde.' '~A line with_two_underscores starting with a tilde.' '~A line with+two+underscores starting with a tilde.' '___' '___' '~___' '~+++'

Update 1: FFR & FWIW, here's a version in the form of a Short, Self-Contained, Correct Example using a How to ask better questions using Test::More and sample data structure. Of course, the original question would have been submitted with plenty of test cases (and don't forget degenerate and simple cases!) and with the
    $input =~ s{ ... }{+}xmsg;
statement being raygun's current, unacceptable one, or maybe just a placeholder.

c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; use Test::More 'no_plan'; use Test::NoWarnings; ;; my @test_set = ( 'degenerate and simple cases', [ '', '', 'empty line' ], [ ' ', ' ', 'single space' ], [ '~', '~', 'single tilde' ], [ '_', '_', 'single underscore' ], [ '~_', '~+', 'tilde, underscore' ], [ '_~', '_~', 'underscore, tilde' ], [ '__', '__', 'multiple underscores' ], [ '~__', '~++', 'tilde, multiple underscores' ], [ '__~', '__~', 'multiple underscores, tilde' ], 'more complicated cases', [ 'A line with an_underscore.', 'A line with an_underscore.', 'no leading tilde, text w/one underscore' ], [ 'A line with_two_underscores.', 'A line with_two_underscores.', 'no leading tilde, text w/two underscores' ], [ '~A line with an_underscore starting with a tilde.', '~A line with an+underscore starting with a tilde.', 'leading tilde, text w/one underscore' ], [ '~A line with_two_underscores starting with a tilde.', '~A line with+two+underscores starting with a tilde.', 'leading tilde, text w/two underscores' ], ); ;; VECTOR: for my $ar_vector (@test_set) { if (not ref $ar_vector) { note $ar_vector; next VECTOR; } ;; my ($input, $expected, $comment) = @$ar_vector; ;; $input =~ s{ (?: \G (?! \A) | \A ~) .*? \K _ }{+}xmsg; is $input, $expected, $comment; } ;; done_testing; ;; exit; " # degenerate and simple cases ok 1 - empty line ok 2 - single space ok 3 - single tilde ok 4 - single underscore ok 5 - tilde, underscore ok 6 - underscore, tilde ok 7 - multiple underscores ok 8 - tilde, multiple underscores ok 9 - multiple underscores, tilde # more complicated cases ok 10 - no leading tilde, text w/one underscore ok 11 - no leading tilde, text w/two underscores ok 12 - leading tilde, text w/one underscore ok 13 - leading tilde, text w/two underscores 1..13 ok 14 - no warnings 1..14
Update 2: Removed an extraneous  VECTOR: label that had crept into SSCCE code example. Code function unchanged.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: /g option not making s// find all matches (updated)
by raygun (Scribe) on May 29, 2018 at 03:05 UTC
    Thank you for the detailed answer. For readability, I too like the next unless ...; solution; however, in context, I kind of needed a single regular expression to avoid retooling the surrounding code.

    I never would have thought of the \G (?! \A) construct — I'm still not quite certain I understand how it works, but it does the trick!

    The /ms options seem unnecessary here; did you include them just to make the solution as generic as possible?

      ... the \G (?! \A) construct ... how it works ...

      This simply asserts that \G (at previous match point or at absolute start of string if first match) is true, and that \A (at absolute start of string) is not true; i.e., that \G is not matching at the start of the string. This is a little confusing in that  (?! ...) is a negative look-ahead and you may wonder how one can look ahead to the absolute start of a string. However, \G and \A are both zero-width assertions, so it doesn't matter which way you look as long as you're negating. The following all work identically:
          \G (?! \A)    \G (?<! \A)    (?! \A) \G    (?<! \A) \G

      The /ms options seem unnecessary here; did you include them just to make the solution as generic as possible?

      In line with TheDamian's regex Perl Best Practices, I always use an  /xms tail on every  qr// m// s/// expression I write. Of course, /x allows whitespace and comments; can't be bad. In addition,  ^ $ . always behave in the same way and I don't have to think about it any more; regexes are complicated enough as it is. E.g., the behavior of  . (dot) is "by default, dot matches everything except a newline unless the /s modifier is asserted, in which case dot matches everything — now let's see whether or not there's a /s around anywhere". TheDamian recommends and I prefer to always use the /s modifier and just think about the behavior of dot as "dot matches all". Period. Similarly for the  ^ $ assertions and the /m modifier. (In general, I have quite a bit of respect for TheDamian's PBPs. I don't agree with them all, but I have embraced the regex PBPs completely and wholeheartedly.)

      So in answer to your direct question, the universal  /xms tail is not used to make regexes generic so much as to make them less-thought-needed-ic.


      Give a man a fish:  <%-{-{-{-<

        The following all work identically:
        \G (?! \A) \G (?<! \A) (?! \A) \G (?<! \A) \G
        That's brilliant and kind of hurts my brain... :-)
        In line with TheDamian's regex Perl Best Practices, I always use an /xms tail on every qr// m// s/// expression I write.
        Thanks for the pointer to that. I'll look more into it. In general, I like to use default behaviors unless I need to do something that the default can't accomplish. Then the mechanism to override the default (appending /xms in this case) becomes part of the code's self-documentation, alerting the reader that something outside the norm is happening. (That philosophy fails if some program or interface's defaults are insane, but in my experience, perl's are pretty solid.)

        Also, from a readability standpoint, if you have ten regexes, nine of which end in / and the tenth ending in /m, it's easy to see at a glance that the tenth is doing something outside the default. But if you define your default to be /xms, in code with nine regexes ending in /xms and a tenth ending in /xs, the reader is much more likely to overlook the fact that the tenth instance is overriding the local default.

        But, again, I say all this without having digested the rationale for TheDamian's recommendations, so it's all FWIW.