in reply to Re^2: zero-length match increments pos() (saner)
in thread zero-length match increments pos()
What it does is enforce that two zero length matches can't begin at the same point.
Ah. It took me a bit to figure out the crux of the difference you were pointing out so I highlighted it above. I had indeed missed that point and thank you for pointing it out.
But it does more than just prevent repeats. It also enforces that the next match must start after the end of the previous match. Zero-width matches make this rule insufficient. And I find that it makes more sense to extend this idea differently than Perl does.
If we follow the useful STL convention and define "end(N)" (the end of the Nth match) to be the character after the last character in the match (so that begin(N) <= end(N) and we don't have to try to talk about the spaces between characters), then the common-sense rule boils down to end(N) <= begin(N+1).
My proposal is that the above rule be extended as:
begin(N) <= end(N) <= begin(N+1) <= end(N+1) begin(N) < begin(N+1) end(N) < end(N+1)
While what Perl5 does can't be expressed (that I can see) with such rules. Perhaps something like:
begin(N) <= end(N) <= begin(N+1) <= end(N+1) skip N+1 if begin(N)==end(N)==begin(N+1)==end(N+1)
Which is nice in one regard because it provides the maximum number of matches possible while obeying the first rule and not allowing repeats. But I don't think it is the best choice (considering "least surprise", for example).
If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point. My expectation is that begin(N) < begin(N+1).
- tye
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: zero-length match increments pos() (two!)
by demerphq (Chancellor) on Nov 09, 2006 at 00:24 UTC | |
by tye (Sage) on Nov 09, 2006 at 03:48 UTC | |
by demerphq (Chancellor) on Nov 09, 2006 at 10:59 UTC |