in reply to Re^2: zero-length match increments pos() (saner)
in thread zero-length match increments pos()

What it does is enforce that two zero length matches can't begin at the same point.

Ah. It took me a bit to figure out the crux of the difference you were pointing out so I highlighted it above. I had indeed missed that point and thank you for pointing it out.

But it does more than just prevent repeats. It also enforces that the next match must start after the end of the previous match. Zero-width matches make this rule insufficient. And I find that it makes more sense to extend this idea differently than Perl does.

If we follow the useful STL convention and define "end(N)" (the end of the Nth match) to be the character after the last character in the match (so that begin(N) <= end(N) and we don't have to try to talk about the spaces between characters), then the common-sense rule boils down to end(N) <= begin(N+1).

My proposal is that the above rule be extended as:

begin(N) <= end(N) <= begin(N+1) <= end(N+1) begin(N) < begin(N+1) end(N) < end(N+1)

While what Perl5 does can't be expressed (that I can see) with such rules. Perhaps something like:

begin(N) <= end(N) <= begin(N+1) <= end(N+1) skip N+1 if begin(N)==end(N)==begin(N+1)==end(N+1)

Which is nice in one regard because it provides the maximum number of matches possible while obeying the first rule and not allowing repeats. But I don't think it is the best choice (considering "least surprise", for example).

If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point. My expectation is that begin(N) < begin(N+1).

- tye        

Replies are listed 'Best First'.
Re^4: zero-length match increments pos() (two!)
by demerphq (Chancellor) on Nov 09, 2006 at 00:24 UTC

    If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point.

    What is 'it' in this context? Im assuming you mean the quantifier ??, in which case ill say that A? is syntactic sugar for (?:A|) and A?? is for (?:|A) and in 5.10 we will have A?+ which will be the same as (?>A|). Note all three are different, and have a choice point before consuming the 'A' part of the pattern (the entry into the alternation).

    I will admit that the optimiser should be clever enough to turn /A?B/ and /A??B/ into /A?+B/ when A does not match B. But I dont think it makes sense to talk about A?? not matching the empty string first when A can match B, after all you dont wan't 'a'=~/a??a/ to fail because the first a matched (if you wanted that you should have written a?a). Having subtleties like this does allow one to make some mistakes but it also alows the user to hand tailor how the match proceeds.

    ---
    $world=~s/war/peace/g

      "It" is \w??, and this isn't about backtracking. \w?? needs to match nothing first, and then match one more character if required to by backtracking (and then match two characters if required to again, etc.).

      I'm talking about what matches are returned in the end, not what is matched "so far" while the regex is still trying to find the next complete match.

      Consider this example:

      "xyz" =~ /\w*?/g;

      What Perl will return can be easily illustrated by

      s/(\w*?)/($1)/g

      Which gives us 7 matches:

      ()(x)()(y)()(z)()

      Which is the maximum number of matches without overlap or duplicates. But it includes lots of pairs of matches starting at the same points and lots of pairs of matches ending at the same point. And I think including such pairs of matches is counter to DWIM, and looking at non-Perl regex implementations has reinforced my position on this.

      The anti-greedy \w*? should only match more than one character if forced to by backtracking.

      The equivalent vim demonstration shows much more reasonable behavior:

      :s/\v(.{-0,})/(\1)/g

      Gives

      ()x()y()z

      Note quite perfect, IMO, but much saner. I think the best answer is:

      ()x()y()z()

      Perl forces \w*? to match more than it really wants in order avoid an infinite loop by forcing backtracking even though the pattern didn't fail. Instead, it should avoid the infinite loop by starting to look for the next match at the next position (if the previous match was zero-width).

      You can implement begin(N) < begin(N+1) by taking the "previous match was zero-width" and change its behavior. Currently, Perl see that bit and backtracks if it finds another zero-width match. Instead, Perl should see that bit and just increment pos() before looking for the next match.

      - tye        

        Ok, thanks for the clarification. For the record im _kinda_ looking in to this and we will see where it gets.

        Im going to see what I can do to forbid a zero length match from ending where the previous non zero match ended. I make no promises tho.

        ---
        $world=~s/war/peace/g