Errto has asked for the wisdom of the Perl Monks concerning the following question:

In their replies to counting overlapping patterns, both Eimi Metamorphoumai and betterworld (Update: betterworld's example is list context so that's almost a separate question ;Update 2: actually no, it's the same issue) submitted code that looked essentially like:
#!/usr/bin/perl my $foo = "AAAA"; while ($foo =~ /(?=AA)/g) { print pos $foo; }
which prints out "012" I am confused as to why this does not cause an infinite loop that continuously prints "0." According to perlop, following a successful match the value of pos() is set to the position after the end of the matched string, which is where the next iteration of m//g picks up. But since in this case, the matched string is zero-length, why does pos() ever advance? I ask because this seems like a cool trick, but I hesitate to use it since I'm not clear on why it works.

Replies are listed 'Best First'.
Re: zero-length match increments pos()
by ysth (Canon) on Feb 21, 2005 at 04:24 UTC
    Basically, a zero-length match isn't allowed to occur at the same position twice, whether in successive scalar matches or a single list context match. In addition to pos() being saved for each scalar, there's a flag that says whether it was zero-length or not. This is documented in http://search.cpan.org/perldoc/perlre#Repeated_patterns_matching_zero-length_substring.

    The flag can be cleared if needed by setting pos: pos($foo) = pos($foo).

      I used to use the pos($foo) = pos($foo) trick until Klaus Weidner pointed out to me in a patch to Text::xSV the useful fact that using the obscurely documented /c modifier keeps the flag from being set in the first place. In addition to being cleaner this fixed a segfault on long strings due to an internal Perl bug.
        If //c does that, I would call it a bug. Can you show a short test case? In at least this simple case, the flag (called MINMATCH, i.e. a minimum length of 1 will be required on the next match) is set:
        $ perl -we'use Devel::Peek; ($x="abc")=~/.??/gc; Dump $x' SV = PVMG(0x10183f00) at 0x10167ca4 REFCNT = 1 FLAGS = (SMG,POK,pPOK) IV = 0 NV = 0 PV = 0x10142828 "abc"\0 CUR = 3 LEN = 4 MAGIC = 0x10147380 MG_VIRTUAL = &PL_vtbl_mglob MG_TYPE = PERL_MAGIC_regex_global(g) MG_FLAGS = 0x01 MINMATCH
        (Note that older perls have a bug where Devel::Peek doesn't correctly show the flag by name, but you can still see the bit in MG_FLAGS.)
Re: zero-length match increments pos()
by Zaxo (Archbishop) on Feb 21, 2005 at 02:03 UTC

    It's the global match that does the trick. Even for a zero-width pattern, global matching causes pos to be advanced to the next point in the string where a new match may be tried.

    After Compline,
    Zaxo

Re: zero-length match increments pos()
by Aristotle (Chancellor) on Feb 21, 2005 at 05:06 UTC

    I am confused as to why this does not cause an infinite loop that continuously prints "0."

    It used to, in old versions of Perl. But many people would write stuff like

    $_ = "aabbccddeebbff"; while( /b*/g ) { # ... }

    and then be surprised that this was an infinite loop. So as Zaxo says, logic was added to make sure that with /g, even matches with zero overall length still advance the pointer.

    This is documented, as per ysth's post.

    Makeshifts last the longest.

Re: zero-length match increments pos() (saner)
by tye (Sage) on Feb 21, 2005 at 17:34 UTC

    Note that it would be even better if the regex engine, instead of just enforcing that two matches can't start at the same offset, would also enforce that two matches can't end at the same offset. For example:

    $str= "bbbcccbbb" while( $str =~ /b*/g ) { print "($`)<$&>($')\n" } __END__ ()<bbb>(cccbbb) (bbb)<>(cccbbb) <-- This one (bbbc)<>(ccbbb) (bbbcc)<>(cbbb) (bbbccc)<bbb>() (bbbcccbbb)<>() <-- This one

    The two indicated matches should be skipped since they end at the same offset as the matches that preceed them. Note that most non-Perl regex processors know that this makes sense. For example, sed doesn't make this mistake. In vi:

    a bbbcccbbb . s/(b*)/x\1/g

    produces

    xbbbcxcxcxbbb /\ /\ no "x"s there

    as is sensible.

    A minor surprise but something worth fixing when not shackled by backward compatability (ie. this deserves to be fixed in Perl6).

    - tye        

      I expect there will be more control over this feature in perl6 - we've certainly discussed the need, specifically with reference to //gc-style matches, though I don't think Larry has settled on the mechanisms yet.

      I don't think I fully understand the vi interpretation from your description, but even if there is no explicit combination of overrides that would request it I'd expect it to be easy to add such a thing by subclassing the grammar grammar.

      Hugo

        The thought plickens...

        I wanted to add some more examples to make sure the point is clear and so needed a handy copy of sed and eventually turned to my Zaurus (since I was in bed) and produced:

        $ echo bbbaaabbb | sed -e 's/\(b*\)/(\1)/g' (bbb)()a()a()a(bbb) $

        which added a new point on the speculum (ducks1).

        I eventually calmed down and convinced myself it was just a quirk of busybox's imitation of sed and found a real copy of sed on FreeBSD and produced:

        $ echo bbbaaabbb | sed -e 's/\(b*\)/(\1)/g' (bbb)a()a()a(bbb) $

        to compare to Perl:

        $ echo bbbaaabbb | perl -pe 's/(b*)/(\1)/g' (bbb)()a()a()a(bbb)() () $ echo bbbaaabbb | perl -lpe 's/(b*)/(\1)/g' > (bbb)()a()a()a(bbb)() $

        So we see that the ancient lords of s///g, sed and vi(ex), agree that it doesn't make sense for two successive matches to end at the same point.

        We also see how easy it is to overlook this point. The authors of busybox (or the regex library it uses) realized that once you reach the end, you are done, but not that it doesn't make sense for two matches to end at the same point other than at the end: (bbb)()a()a()a(bbb)

        So I'm sure Perl6 will need to support Perl5-compatable mode, but it'd be nice if it'd also supported sed / vi / saner mode (and, personally, I'd make that the default mode -- the Perl5 mode has even been accused of being a "bug" right here at PerlMonks more than once, other than by me).

        While thinking about this, I also envisioned a fun 'watch me backtrack' mode.

        - tye        

        1 That's enough to make a Welsh Harlequin blush.

      I completely disagree. If a regex engine fails to find the zero width match at the end of a string then it is broken.

      Note the correct way to refer to capture variables in the RHS is $1 etc, not \1.

        It already found a match at the end of the string. It should not find two different matches at the end of the string.

        If I'd used $1 then I would have gotten

        x$1cx$1cx$1cx$1

        because it was vi, which isn't Perl (because otherwise it would have not been a good choice for demonstrating how it doesn't act like Perl).

        - tye        

      Note that it would be even better if the regex engine, instead of just enforcing that two matches can't start at the same offset, would also enforce that two matches can't end at the same offset. For example:
      But it doesn't enforce that at all. What it does is enforce that two zero length matches can't begin at the same point. All it's doing is keeping matches from being repeated. It is consistent with regards to behavior at each end of the string.

      For example:

      $ echo abcd | perl -nle 'print "=$1=" while /(^|\w?)/g;' == =a= =b= =c= =d= ==
      See how the first two matches begin at the same point, just as the last two end at the same point. Or, for more examples of zero-length matches, try this:
      $ echo abcd | perl -nle 'print "=$1=" while /(\w??)/g;' == =a= == =b= == =c= == =d= ==
      Note that some implementations of regular expressions which claim to be "perl compatible" (I'm looking at you, java.util.regex) are less smart than perl in this respect. Instead, they do what you accused perl of and prevent any two matches from beginning at the same point. You're right that this is inconsistent with having zero-length matches near the end of the string.
      -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
        What it does is enforce that two zero length matches can't begin at the same point.

        Ah. It took me a bit to figure out the crux of the difference you were pointing out so I highlighted it above. I had indeed missed that point and thank you for pointing it out.

        But it does more than just prevent repeats. It also enforces that the next match must start after the end of the previous match. Zero-width matches make this rule insufficient. And I find that it makes more sense to extend this idea differently than Perl does.

        If we follow the useful STL convention and define "end(N)" (the end of the Nth match) to be the character after the last character in the match (so that begin(N) <= end(N) and we don't have to try to talk about the spaces between characters), then the common-sense rule boils down to end(N) <= begin(N+1).

        My proposal is that the above rule be extended as:

        begin(N) <= end(N) <= begin(N+1) <= end(N+1) begin(N) < begin(N+1) end(N) < end(N+1)

        While what Perl5 does can't be expressed (that I can see) with such rules. Perhaps something like:

        begin(N) <= end(N) <= begin(N+1) <= end(N+1) skip N+1 if begin(N)==end(N)==begin(N+1)==end(N+1)

        Which is nice in one regard because it provides the maximum number of matches possible while obeying the first rule and not allowing repeats. But I don't think it is the best choice (considering "least surprise", for example).

        If \w?? matches an empty string rather than "a" (because it prefers the shorter match, it being anti-greedy), then I don't expect it to go on to match "a" next; it already made its decision regarding "a" and should move on to the next decision point. My expectation is that begin(N) < begin(N+1).

        - tye        

Re: zero-length match increments pos()
by halley (Prior) on Feb 21, 2005 at 15:30 UTC

    In the book, Mastering Regular Expressions, the author discusses this phenomenon. By the way, it is exactly for this reason (the advancing pos()) that split // ends up separating the string into one-character components.

    --
    [ e d @ h a l l e y . c c ]