b2sing4u has asked for the wisdom of the Perl Monks concerning the following question:

Look at the code below.
$s1 = "abc"; $s2 = "abc"; $s1 =~ s/.*/t/; $s2 =~ s/.*/t/g; print $s1; print "\n"; print $s2;
The output should be
t t
But the actual output is
t tt
Is this a bug?

Replies are listed 'Best First'.
Re: Incorrect pattern recognition problem.
by ikegami (Patriarch) on Oct 31, 2006 at 07:21 UTC
    That happens when the pattern can match 0 characters.
    • Pass 1: The regexp matches 3 characters starting at pos 0. They are replaced with "t". pos is now 3.
    • Pass 2: The regexp matches 0 characters starting at pos 3. They are replaced with "t". pos is now 3.
    • Pass 3: The regexp already matched at 3, so matching will start at 4. The regexp fails to match at pos 4 or later.

    Perhaps you wanted $s2 =~ s/.+/t/g;

    • Pass 1: The regexp matches 3 characters starting at pos 0. They are replaced with "t". pos is now 3.
    • Pass 2: The regexp fails to match at pos 3 or later.
Re: Incorrect pattern recognition problem.
by davido (Cardinal) on Oct 31, 2006 at 07:28 UTC

    To further support ikegami's correct explanation, have a look at the following adaptation on your code:

    use strict; use warnings; my $s1 = "abc"; my $s2 = "abc"; $s1 =~ s/.*/t/; $s2 =~ s/(?{print pos(), qq!\n!}).*/t/g; print "$s1\n$s2\n";

    And the output,

    0 3 3 t tt

    On the first pass, the position pointer is at 0, and .* greedily matches 'abc'. On the next pass the pointer is at position 3 (the end of the string has been reached) .* matches nothing, which is also legal. The third pass finds that the position pointer cannot be advanced further (still at 3), and fails immediately, ending the /g loop.


    Dave

Re: Incorrect pattern recognition problem.
by bobf (Monsignor) on Oct 31, 2006 at 07:25 UTC

    This appears to be a situation when the pattern matches a zero-length substring (see perlre, "Repeated patterns matching zero-length substring"). Using s/.+/t/g gives the expected result.

    Update: Adding use re 'debug'; to the code produces some interesting output that illustrates the documented behavior. Note how it says Matching REx ".*" against "" <stuff deleted> Match successful! midway through. If I understand this correctly, that is where it matches the empty string and produces the extra 't' in the output.

    Compiling REx `.*' size 3 Got 28 bytes for offset annotations. first at 2 1: STAR(3) 2: REG_ANY(0) 3: END(0) anchored(MBOL) implicit minlen 0 Offsets: [3] 2[1] 1[1] 3[0] Matching REx ".*" against "abc" Setting an EVAL scope, savestack=5 0 <> <abc> | 1: STAR REG_ANY can match 3 times out of 2147483647 +... Setting an EVAL scope, savestack=5 3 <abc> <> | 3: END Match successful! Matching REx ".*" against "" Setting an EVAL scope, savestack=5 3 <abc> <> | 1: STAR REG_ANY can match 0 times out of 2147483647 +... Setting an EVAL scope, savestack=5 3 <abc> <> | 3: END Match successful! Matching REx ".*" against "" Setting an EVAL scope, savestack=5 3 <abc> <> | 1: STAR REG_ANY can match 0 times out of 2147483647 +... Setting an EVAL scope, savestack=5 3 <abc> <> | 3: END Match possible, but length=0 is smaller than requested=1, failing! failed... Match failed ttFreeing REx: `".*"'
    See perldebguts for more information on interpreting this output.

Re: Incorrect pattern recognition problem. (bug)
by tye (Sage) on Oct 31, 2006 at 15:18 UTC

    I consider it a bug, though a design bug. I explain how I would improve this in Re^3: zero-length match increments pos() (two!), but you'll probably have to read more of zero-length match increments pos() to understand what I'm talking about (which will also get you a good explanation of how Perl decides which surprising results to return).

    It boils down to the fact that Perl's regex engine is forced to not return certain matches else it'd cause infinite loops. But the least-surprise choice skips a few more matches (only when zero-width matches are possible) and that choice is what 'sed' and 'vi' both return but not what Perl does.

    - tye