in reply to Greediness of * vs. greediness of +

This is a "further" question re the replies from moritz and Marshall.

And in following their answers, I found it helpful to remember that "greedy" is not the same as "global." While (b*) is greedy, it is not global, /(b*)/g. In other words (I thought) /(b*)/ stops after the first (moritz caught this)failure success , at the start of the string, whereas adding the /g would tell the regex engine to keep on trying until it reaches the end of the string.

Well, that was my second thought.

But, OOOPS,

"abbbbc"=~/(b*)/g && print "Found: $1"; # Found:

Huh?

Well, is this a case where the rules are different in substitution?

my $string1 = "abbbbc"; my $found1 = $string1 =~ s/(b*)/^/g; print "\$string1: $string1, \$found1: $found1"; $found1 is 4 =head after s/// $string: ^a^^c^ At begin of string ...no "b" found | # satisfies "0 or more 'b's" "a" (duh!) | Two "b"(s) found | # likewise; the first and second 'b's ar +e conflated? and again | "c" | no "b" after "c" | =cut

That this code produces two replacements for the string of four 'b's remains a puzzle. Why does this appear (this may be my error) that the regex conflates two 'b's rather than all four?

Enlightenment?
or Coffee?

Update: Wonderful answer below. s/failure/success/ per moritz; italics closed per ssandv.

Replies are listed 'Best First'.
Re^2: Greediness of * vs. greediness of +
by moritz (Cardinal) on Sep 08, 2010 at 13:07 UTC
    In other words (I thought) /(b*)/ stops after the first failure, at the start of the string,

    s/failure/success/

    whereas adding the /g would tell the regex engine to keep on trying until it reaches the end of the string.

    What /g does on a regex depends on context. In boolean scalar context it matches once, and stores the position in pos. If you execute it a second time, it starts off from where it left.

    The background is that you can write

    while (/(b*)/g) { ... }

    and get a new match for each iteration:

    $perl -e '$_="abbbabc"; while (/(b*)/g) { print "($1)\n" }' () (bbb) () (b) () ()

    Update: Answer to the second question

    That this code produces two replacements for the string of four 'b's remains a puzzle. Why does this appear (this may be my error) that the regex conflates two 'b's rather than all four?

    A naiive substitution implementation would loop on s/b*/^/, because it would continue to replace the empty string with ^ forever on.

    Perl is a bit more sophisticated: It detects a zero-width match, and before doing a second substition of a zero-width match at the same position it bumps along, and tries in the next position.

    So applying s/b*/^/ on abba make these steps:

    abba | match zero b's before a ^abba | match zero b's again. Don't substitute here, bump along ^abba | match 'bb' ^a^a | match zero b's ^a^^a | match zero b's, don't substitute but bump along ^a^^a^ | match zero b's, don't substitute but bump along

    You can watch it work; I didn't find a way to get the modified string, but at least you can monitor the match positions:

    $ perl -le '$_ = "abba"; s/b*/print pos; "^"/eg; print' 0 1 3 4 ^a^^a^
    Perl 6 - links to (nearly) everything that is Perl 6.
      If you execute it a second time, it starts off from where it left.
      If that where true, while (/(b*)/g) would never finish. It starts where it finished the previous time, unless it matched an empty string. In the latter case, pos() will be advanced by one. (Details are even more complicated. But documented. See the section "Repeated Patterns Matching a Zero-length Substring" in the perlre manual page.)