"It" is \w??, and this isn't about backtracking. \w?? needs to match nothing first, and then match one more character if required to by backtracking (and then match two characters if required to again, etc.).
I'm talking about what matches are returned in the end, not what is matched "so far" while the regex is still trying to find the next complete match.
Consider this example:
"xyz" =~ /\w*?/g;
What Perl will return can be easily illustrated by
s/(\w*?)/($1)/g
Which gives us 7 matches:
()(x)()(y)()(z)()
Which is the maximum number of matches without overlap or duplicates. But it includes lots of pairs of matches starting at the same points and lots of pairs of matches ending at the same point. And I think including such pairs of matches is counter to DWIM, and looking at non-Perl regex implementations has reinforced my position on this.
The anti-greedy \w*? should only match more than one character if forced to by backtracking.
The equivalent vim demonstration shows much more reasonable behavior:
:s/\v(.{-0,})/(\1)/g
Gives
()x()y()z
Note quite perfect, IMO, but much saner. I think the best answer is:
()x()y()z()
Perl forces \w*? to match more than it really wants in order avoid an infinite loop by forcing backtracking even though the pattern didn't fail. Instead, it should avoid the infinite loop by starting to look for the next match at the next position (if the previous match was zero-width).
Looking at my formulation of the rules and all of the possible matches that Perl would go through if forced to by backtracking:
begin(N) <= end(N) <= begin(N+1) <= end(N+1) begin(N) < begin(N+1) end(N) < end(N+1)
What I'd do:
A ()xyz # Return it b (x)yz # Fails: 0==begin(A) < begin(b)==0 c (xy)z # Fails: 0==begin(A) < begin(c)==0 d (xyz) # Fails: 0==begin(A) < begin(d)==0 E x()yz # Return it f x(y)z # Fails: 1==begin(E) < begin(f)==1 g x(yz) # Fails: 1==begin(E) < begin(g)==1 H xy()z # Return it i xy(z) # Fails: 2==begin(H) < begin(i)==2 J xyz() # Return it
What Perl does:
A ()xyz # Return it B (x)yz # Return it c (xy)z # Fails: 1==end(B) <= begin(c)==0 d (xyz) # Fails: 1==end(B) <= begin(d)==0 E x()yz # Return it F x(y)z # Return it g x(yz) # Fails: 2==end(F) <= begin(g)==1 H xy()z # Return it I xy(z) # Return it J xyz() # Return it
You can implement begin(N) < begin(N+1) by taking the "previous match was zero-width" and change its behavior. Currently, Perl see that bit and backtracks if it finds another zero-width match. Instead, Perl should see that bit and just increment pos() before looking for the next match.
Since this saner matching scheme means that no two matches ever start at the same point, it seems as if you could get rid of the hidden "last match was zero-width" bit. Doing so would be a good thing to do since there is no way to query/set that bit for check-pointing a piece-wise /.../g, something I've often done by just getting and setting pos().
But I haven't come up with a way to do that that works. It might be possible, but I've given up trying for now.
Then, to implement end(N) < end(N+1): If you find a zero-width match and you haven't yet incremented pos() since starting to look for this match, then you increment pos() and start over (you don't backtrack, because backtracking to avoid a "duplicate" match leads to non-DWIM behavior).
This fixes the original problem that was noticed, extra zero-width matches ending at the same point as the previous non-zero-width match. /.*/gs should match the whole string once, not the whole string and then also match an empty string at the end.
Finally, here is a tougher example for me to justify: Apply /x*?(?!x)/ to "xyz". Perl currently gives:
(x)()y()z()
While in vi we have:
:s/\vx{-0,}(x)@!/(\1)/g (x)y()z
But the best answer is
(x)y()z()
Because having the first two matches end at the same point should be avoided.
- tye
In reply to Re^5: zero-length match increments pos() (anti-greedy)
by tye
in thread zero-length match increments pos()
by Errto
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |