in reply to zero-length match increments pos()

Note that it would be even better if the regex engine, instead of just enforcing that two matches can't start at the same offset, would also enforce that two matches can't end at the same offset. For example:

$str= "bbbcccbbb" while( $str =~ /b*/g ) { print "($`)<$&>($')\n" } __END__ ()<bbb>(cccbbb) (bbb)<>(cccbbb) <-- This one (bbbc)<>(ccbbb) (bbbcc)<>(cbbb) (bbbccc)<bbb>() (bbbcccbbb)<>() <-- This one

The two indicated matches should be skipped since they end at the same offset as the matches that preceed them. Note that most non-Perl regex processors know that this makes sense. For example, sed doesn't make this mistake. In vi:

a bbbcccbbb . s/(b*)/x\1/g

produces

xbbbcxcxcxbbb /\ /\ no "x"s there

as is sensible.

A minor surprise but something worth fixing when not shackled by backward compatability (ie. this deserves to be fixed in Perl6).

- tye        

Replies are listed 'Best First'.
Re^2: zero-length match increments pos() (saner)
by hv (Prior) on Feb 22, 2005 at 12:29 UTC

    I expect there will be more control over this feature in perl6 - we've certainly discussed the need, specifically with reference to //gc-style matches, though I don't think Larry has settled on the mechanisms yet.

    I don't think I fully understand the vi interpretation from your description, but even if there is no explicit combination of overrides that would request it I'd expect it to be easy to add such a thing by subclassing the grammar grammar.

    Hugo

      The thought plickens...

      I wanted to add some more examples to make sure the point is clear and so needed a handy copy of sed and eventually turned to my Zaurus (since I was in bed) and produced:

      $ echo bbbaaabbb | sed -e 's/\(b*\)/(\1)/g' (bbb)()a()a()a(bbb) $

      which added a new point on the speculum (ducks1).

      I eventually calmed down and convinced myself it was just a quirk of busybox's imitation of sed and found a real copy of sed on FreeBSD and produced:

      $ echo bbbaaabbb | sed -e 's/\(b*\)/(\1)/g' (bbb)a()a()a(bbb) $

      to compare to Perl:

      $ echo bbbaaabbb | perl -pe 's/(b*)/(\1)/g' (bbb)()a()a()a(bbb)() () $ echo bbbaaabbb | perl -lpe 's/(b*)/(\1)/g' > (bbb)()a()a()a(bbb)() $

      So we see that the ancient lords of s///g, sed and vi(ex), agree that it doesn't make sense for two successive matches to end at the same point.

      We also see how easy it is to overlook this point. The authors of busybox (or the regex library it uses) realized that once you reach the end, you are done, but not that it doesn't make sense for two matches to end at the same point other than at the end: (bbb)()a()a()a(bbb)

      So I'm sure Perl6 will need to support Perl5-compatable mode, but it'd be nice if it'd also supported sed / vi / saner mode (and, personally, I'd make that the default mode -- the Perl5 mode has even been accused of being a "bug" right here at PerlMonks more than once, other than by me).

      While thinking about this, I also envisioned a fun 'watch me backtrack' mode.

      - tye        

      1 That's enough to make a Welsh Harlequin blush.

        Do we need a mode for this? Getting all the matches? Its possible to do with an embedded code block (as I think you know :-)

        perl -le"$_='bbaabb'; /b*(?{print '.' x $-[0],qq<($&)>})(*FAIL)/g" (bb) (b) () .(b) .() ..() ...() ....(bb) ....(b) ....() .....(b) .....() ......()

        On earlier perls than mine you can spell (*FAIL) as (?!)

        ---
        $world=~s/war/peace/g

        I also envisioned a fun 'watch me backtrack' mode.

        These are precisely the matches that will be returned by another option, which I think was called ':exhaustive'.

        I expect that option also to be very useful for combinatorial exercises.

        Hugo

Re^2: zero-length match increments pos() (saner)
by nobull (Friar) on Feb 21, 2005 at 19:04 UTC
    I completely disagree. If a regex engine fails to find the zero width match at the end of a string then it is broken.

    Note the correct way to refer to capture variables in the RHS is $1 etc, not \1.

      It already found a match at the end of the string. It should not find two different matches at the end of the string.

      If I'd used $1 then I would have gotten

      x$1cx$1cx$1cx$1

      because it was vi, which isn't Perl (because otherwise it would have not been a good choice for demonstrating how it doesn't act like Perl).

      - tye        

Re^2: zero-length match increments pos() (saner)
by fizbin (Chaplain) on Feb 23, 2005 at 17:51 UTC
    Note that it would be even better if the regex engine, instead of just enforcing that two matches can't start at the same offset, would also enforce that two matches can't end at the same offset. For example:
    But it doesn't enforce that at all. What it does is enforce that two zero length matches can't begin at the same point. All it's doing is keeping matches from being repeated. It is consistent with regards to behavior at each end of the string.

    For example:

    $ echo abcd | perl -nle 'print "=$1=" while /(^|\w?)/g;' == =a= =b= =c= =d= ==
    See how the first two matches begin at the same point, just as the last two end at the same point. Or, for more examples of zero-length matches, try this:
    $ echo abcd | perl -nle 'print "=$1=" while /(\w??)/g;' == =a= == =b= == =c= == =d= ==
    Note that some implementations of regular expressions which claim to be "perl compatible" (I'm looking at you, java.util.regex) are less smart than perl in this respect. Instead, they do what you accused perl of and prevent any two matches from beginning at the same point. You're right that this is inconsistent with having zero-length matches near the end of the string.
    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      What it does is enforce that two zero length matches can't begin at the same point.

      Ah. It took me a bit to figure out the crux of the difference you were pointing out so I highlighted it above. I had indeed missed that point and thank you for pointing it out.

      But it does more than just prevent repeats. It also enforces that the next match must start after the end of the previous match. Zero-width matches make this rule insufficient. And I find that it makes more sense to extend this idea differently than Perl does.

      If we follow the useful STL convention and define "end(N)" (the end of the Nth match) to be the character after the last character in the match (so that begin(N) <= end(N) and we don't have to try to talk about the spaces between characters), then the common-sense rule boils down to end(N) <= begin(N+1).

      My proposal is that the above rule be extended as:

      begin(N) <= end(N) <= begin(N+1) <= end(N+1) begin(N) < begin(N+1) end(N) < end(N+1)

      While what Perl5 does can't be expressed (that I can see) with such rules. Perhaps something like:

      begin(N) <= end(N) <= begin(N+1) <= end(N+1) skip N+1 if begin(N)==end(N)==begin(N+1)==end(N+1)

      Which is nice in one regard because it provides the maximum number of matches possible while obeying the first rule and not allowing repeats. But I don't think it is the best choice (considering "least surprise", for example).

      If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point. My expectation is that begin(N) < begin(N+1).

      - tye        

        If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point.

        What is 'it' in this context? Im assuming you mean the quantifier ??, in which case ill say that A? is syntactic sugar for (?:A|) and A?? is for (?:|A) and in 5.10 we will have A?+ which will be the same as (?>A|). Note all three are different, and have a choice point before consuming the 'A' part of the pattern (the entry into the alternation).

        I will admit that the optimiser should be clever enough to turn /A?B/ and /A??B/ into /A?+B/ when A does not match B. But I dont think it makes sense to talk about A?? not matching the empty string first when A can match B, after all you dont wan't 'a'=~/a??a/ to fail because the first a matched (if you wanted that you should have written a?a). Having subtleties like this does allow one to make some mistakes but it also alows the user to hand tailor how the match proceeds.

        ---
        $world=~s/war/peace/g