in reply to Re: zero-length match increments pos() (saner)
in thread zero-length match increments pos()

Note that it would be even better if the regex engine, instead of just enforcing that two matches can't start at the same offset, would also enforce that two matches can't end at the same offset. For example:
But it doesn't enforce that at all. What it does is enforce that two zero length matches can't begin at the same point. All it's doing is keeping matches from being repeated. It is consistent with regards to behavior at each end of the string.

For example:

$ echo abcd | perl -nle 'print "=$1=" while /(^|\w?)/g;' == =a= =b= =c= =d= ==
See how the first two matches begin at the same point, just as the last two end at the same point. Or, for more examples of zero-length matches, try this:
$ echo abcd | perl -nle 'print "=$1=" while /(\w??)/g;' == =a= == =b= == =c= == =d= ==
Note that some implementations of regular expressions which claim to be "perl compatible" (I'm looking at you, java.util.regex) are less smart than perl in this respect. Instead, they do what you accused perl of and prevent any two matches from beginning at the same point. You're right that this is inconsistent with having zero-length matches near the end of the string.
-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

Replies are listed 'Best First'.
Re^3: zero-length match increments pos() (two!)
by tye (Sage) on Feb 23, 2005 at 18:17 UTC
    What it does is enforce that two zero length matches can't begin at the same point.

    Ah. It took me a bit to figure out the crux of the difference you were pointing out so I highlighted it above. I had indeed missed that point and thank you for pointing it out.

    But it does more than just prevent repeats. It also enforces that the next match must start after the end of the previous match. Zero-width matches make this rule insufficient. And I find that it makes more sense to extend this idea differently than Perl does.

    If we follow the useful STL convention and define "end(N)" (the end of the Nth match) to be the character after the last character in the match (so that begin(N) <= end(N) and we don't have to try to talk about the spaces between characters), then the common-sense rule boils down to end(N) <= begin(N+1).

    My proposal is that the above rule be extended as:

    begin(N) <= end(N) <= begin(N+1) <= end(N+1) begin(N) < begin(N+1) end(N) < end(N+1)

    While what Perl5 does can't be expressed (that I can see) with such rules. Perhaps something like:

    begin(N) <= end(N) <= begin(N+1) <= end(N+1) skip N+1 if begin(N)==end(N)==begin(N+1)==end(N+1)

    Which is nice in one regard because it provides the maximum number of matches possible while obeying the first rule and not allowing repeats. But I don't think it is the best choice (considering "least surprise", for example).

    If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point. My expectation is that begin(N) < begin(N+1).

    - tye        

      If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point.

      What is 'it' in this context? Im assuming you mean the quantifier ??, in which case ill say that A? is syntactic sugar for (?:A|) and A?? is for (?:|A) and in 5.10 we will have A?+ which will be the same as (?>A|). Note all three are different, and have a choice point before consuming the 'A' part of the pattern (the entry into the alternation).

      I will admit that the optimiser should be clever enough to turn /A?B/ and /A??B/ into /A?+B/ when A does not match B. But I dont think it makes sense to talk about A?? not matching the empty string first when A can match B, after all you dont wan't 'a'=~/a??a/ to fail because the first a matched (if you wanted that you should have written a?a). Having subtleties like this does allow one to make some mistakes but it also alows the user to hand tailor how the match proceeds.

      ---
      $world=~s/war/peace/g

        "It" is \w??, and this isn't about backtracking. \w?? needs to match nothing first, and then match one more character if required to by backtracking (and then match two characters if required to again, etc.).

        I'm talking about what matches are returned in the end, not what is matched "so far" while the regex is still trying to find the next complete match.

        Consider this example:

        "xyz" =~ /\w*?/g;

        What Perl will return can be easily illustrated by

        s/(\w*?)/($1)/g

        Which gives us 7 matches:

        ()(x)()(y)()(z)()

        Which is the maximum number of matches without overlap or duplicates. But it includes lots of pairs of matches starting at the same points and lots of pairs of matches ending at the same point. And I think including such pairs of matches is counter to DWIM, and looking at non-Perl regex implementations has reinforced my position on this.

        The anti-greedy \w*? should only match more than one character if forced to by backtracking.

        The equivalent vim demonstration shows much more reasonable behavior:

        :s/\v(.{-0,})/(\1)/g

        Gives

        ()x()y()z

        Note quite perfect, IMO, but much saner. I think the best answer is:

        ()x()y()z()

        Perl forces \w*? to match more than it really wants in order avoid an infinite loop by forcing backtracking even though the pattern didn't fail. Instead, it should avoid the infinite loop by starting to look for the next match at the next position (if the previous match was zero-width).

        You can implement begin(N) < begin(N+1) by taking the "previous match was zero-width" and change its behavior. Currently, Perl see that bit and backtracks if it finds another zero-width match. Instead, Perl should see that bit and just increment pos() before looking for the next match.

        - tye