in reply to Re^3: zero-length match increments pos() (two!)
in thread zero-length match increments pos()

If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point.

What is 'it' in this context? Im assuming you mean the quantifier ??, in which case ill say that A? is syntactic sugar for (?:A|) and A?? is for (?:|A) and in 5.10 we will have A?+ which will be the same as (?>A|). Note all three are different, and have a choice point before consuming the 'A' part of the pattern (the entry into the alternation).

I will admit that the optimiser should be clever enough to turn /A?B/ and /A??B/ into /A?+B/ when A does not match B. But I dont think it makes sense to talk about A?? not matching the empty string first when A can match B, after all you dont wan't 'a'=~/a??a/ to fail because the first a matched (if you wanted that you should have written a?a). Having subtleties like this does allow one to make some mistakes but it also alows the user to hand tailor how the match proceeds.

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re^5: zero-length match increments pos() (anti-greedy)
by tye (Sage) on Nov 09, 2006 at 03:48 UTC

    "It" is \w??, and this isn't about backtracking. \w?? needs to match nothing first, and then match one more character if required to by backtracking (and then match two characters if required to again, etc.).

    I'm talking about what matches are returned in the end, not what is matched "so far" while the regex is still trying to find the next complete match.

    Consider this example:

    "xyz" =~ /\w*?/g;

    What Perl will return can be easily illustrated by

    s/(\w*?)/($1)/g

    Which gives us 7 matches:

    ()(x)()(y)()(z)()

    Which is the maximum number of matches without overlap or duplicates. But it includes lots of pairs of matches starting at the same points and lots of pairs of matches ending at the same point. And I think including such pairs of matches is counter to DWIM, and looking at non-Perl regex implementations has reinforced my position on this.

    The anti-greedy \w*? should only match more than one character if forced to by backtracking.

    The equivalent vim demonstration shows much more reasonable behavior:

    :s/\v(.{-0,})/(\1)/g

    Gives

    ()x()y()z

    Note quite perfect, IMO, but much saner. I think the best answer is:

    ()x()y()z()

    Perl forces \w*? to match more than it really wants in order avoid an infinite loop by forcing backtracking even though the pattern didn't fail. Instead, it should avoid the infinite loop by starting to look for the next match at the next position (if the previous match was zero-width).

    You can implement begin(N) < begin(N+1) by taking the "previous match was zero-width" and change its behavior. Currently, Perl see that bit and backtracks if it finds another zero-width match. Instead, Perl should see that bit and just increment pos() before looking for the next match.

    - tye        

      Ok, thanks for the clarification. For the record im _kinda_ looking in to this and we will see where it gets.

      Im going to see what I can do to forbid a zero length match from ending where the previous non zero match ended. I make no promises tho.

      ---
      $world=~s/war/peace/g