in reply to ZERO_LENGTH match

I think you are misinterpreting the intent of the documentation: when it refers to 'ZERO_LENGTH' in the equivalence it is talking about actual zero length (ie matching an empty string) rather than potential zero length, such as matching /c?/.

The implication is therefore that, for example:

/( (?: c? )* )/x
is treated as equivalent to:
/( (?: c* | ) )/x
to break the loop.

Similarly for a more complex zero-length expression such as a lookahead:

/( (?: a | (?=c) ) )*/x
is treated as equivalent to:
/( (?: a* | (?=c) ) )/x

Hope this helps,

Hugo

Replies are listed 'Best First'.
Re^2: ZERO_LENGTH match
by Anonymous Monk on Aug 01, 2005 at 17:01 UTC
    /( (?: a* | (?=c) ) )/x

    This doesn't make any sense - a* cannot fail so you'll
    never end up in the position to try the second
    alternative.

    /( (?: a | (?=c) ) )*/x

    This means - match as many 'a'-s as you can and when
    this becomes impossible try the second alternative -
    if it machtes (a zero-width) you'll face an infinite
    loop that you want to break - and you does this by
    allowing only one such zero-width match to happen.

      Yes, I was thinking as I wrote my reply that it would make more sense to break it as:

      /( (?: a+ | (?=c) ) )/x

      But I didn't want to introduce unnecessary complications for the OP, and I wasn't entirely sure there was no deep reason I was missing as to why the docs show /a*/ rather than /a+/ for this.

      Hugo



      ok, the first paragraph is wrong - if you backtrack
      you _can_ use the second alternative but fact is that
      you'll try matching (?=c) at position 0 not at some
      position after the last 'a' in the string which is the
      case in /( (?: a | (?=c) ) )*/x.
      The breaking of the infinite loop only forces (?=c) to
      be tried only once but doesn't redefine the position
      at which this happens.

        Try this:

        #!/usr/bin/perl use re "eval"; # (?{ CODE }) is a classical zero-width assertion # that allways succeeds and it is used only for its # side-effects. my $non_zero_width = 'a(?{ print 1 })'; my $zero_width = '(?{ print 2 })'; # These two should be equivalent according to perlre. my $re1 = qq/ (?: $non_zero_width | $zero_width )* /; my $re2 = qq/ (?: $non_zero_width )* | (?: $zero_width )? /; # But are they really? $_ = 'aaabbb'; print "\n-----------------\n"; /$re1/x; print "\n-----------------\n"; /$re2/x; print "\n-----------------\n";
        The output is:
        -----------------
        1112
        -----------------
        111
        -----------------
        which proves my point.