"It" is \w??, and this isn't about backtracking. \w?? needs to match nothing first, and then match one more character if required to by backtracking (and then match two characters if required to again, etc.).

I'm talking about what matches are returned in the end, not what is matched "so far" while the regex is still trying to find the next complete match.

Consider this example:

"xyz" =~ /\w*?/g;

What Perl will return can be easily illustrated by

s/(\w*?)/($1)/g

Which gives us 7 matches:

()(x)()(y)()(z)()

Which is the maximum number of matches without overlap or duplicates. But it includes lots of pairs of matches starting at the same points and lots of pairs of matches ending at the same point. And I think including such pairs of matches is counter to DWIM, and looking at non-Perl regex implementations has reinforced my position on this.

The anti-greedy \w*? should only match more than one character if forced to by backtracking.

The equivalent vim demonstration shows much more reasonable behavior:

:s/\v(.{-0,})/(\1)/g

Gives

()x()y()z

Note quite perfect, IMO, but much saner. I think the best answer is:

()x()y()z()

Perl forces \w*? to match more than it really wants in order avoid an infinite loop by forcing backtracking even though the pattern didn't fail. Instead, it should avoid the infinite loop by starting to look for the next match at the next position (if the previous match was zero-width).

Looking at my formulation of the rules and all of the possible matches that Perl would go through if forced to by backtracking:

begin(N) <= end(N) <= begin(N+1) <= end(N+1) begin(N) < begin(N+1) end(N) < end(N+1)

What I'd do:

A ()xyz # Return it b (x)yz # Fails: 0==begin(A) < begin(b)==0 c (xy)z # Fails: 0==begin(A) < begin(c)==0 d (xyz) # Fails: 0==begin(A) < begin(d)==0 E x()yz # Return it f x(y)z # Fails: 1==begin(E) < begin(f)==1 g x(yz) # Fails: 1==begin(E) < begin(g)==1 H xy()z # Return it i xy(z) # Fails: 2==begin(H) < begin(i)==2 J xyz() # Return it

What Perl does:

A ()xyz # Return it B (x)yz # Return it c (xy)z # Fails: 1==end(B) <= begin(c)==0 d (xyz) # Fails: 1==end(B) <= begin(d)==0 E x()yz # Return it F x(y)z # Return it g x(yz) # Fails: 2==end(F) <= begin(g)==1 H xy()z # Return it I xy(z) # Return it J xyz() # Return it

You can implement begin(N) < begin(N+1) by taking the "previous match was zero-width" and change its behavior. Currently, Perl see that bit and backtracks if it finds another zero-width match. Instead, Perl should see that bit and just increment pos() before looking for the next match.

Since this saner matching scheme means that no two matches ever start at the same point, it seems as if you could get rid of the hidden "last match was zero-width" bit. Doing so would be a good thing to do since there is no way to query/set that bit for check-pointing a piece-wise /.../g, something I've often done by just getting and setting pos().

But I haven't come up with a way to do that that works. It might be possible, but I've given up trying for now.

Then, to implement end(N) < end(N+1): If you find a zero-width match and you haven't yet incremented pos() since starting to look for this match, then you increment pos() and start over (you don't backtrack, because backtracking to avoid a "duplicate" match leads to non-DWIM behavior).

This fixes the original problem that was noticed, extra zero-width matches ending at the same point as the previous non-zero-width match. /.*/gs should match the whole string once, not the whole string and then also match an empty string at the end.

Finally, here is a tougher example for me to justify: Apply /x*?(?!x)/ to "xyz". Perl currently gives:

(x)()y()z()

While in vi we have:

:s/\vx{-0,}(x)@!/(\1)/g (x)y()z

But the best answer is

(x)y()z()

Because having the first two matches end at the same point should be avoided.

- tye        


In reply to Re^5: zero-length match increments pos() (anti-greedy) by tye
in thread zero-length match increments pos() by Errto

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.