sh1tn has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

From perlre:
The lower-level loops are interrupted (that is, the loop is broken) wh +en Perl detects that a repeated expression matched a zero-length subs +tring. Thus: m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; is made equivalent to m{ (?: NON_ZERO_LENGTH )* | (?: ZERO_LENGTH )? }x;
But in fact:
$_ = 'aaabbb'; /((?: a | c? )*)/x; # 'c' matches once with zero-length # which is not the same as /(( a )* | ( c? )?)/x; # where 'c' does not match at all
My question - is this behaviour expected? Thank you for the help!

Update: code sample given below - Re^4: ZERO_LENGTH match


Replies are listed 'Best First'.
Re: ZERO_LENGTH match
by hv (Prior) on Aug 01, 2005 at 16:15 UTC

    I think you are misinterpreting the intent of the documentation: when it refers to 'ZERO_LENGTH' in the equivalence it is talking about actual zero length (ie matching an empty string) rather than potential zero length, such as matching /c?/.

    The implication is therefore that, for example:

    /( (?: c? )* )/x
    is treated as equivalent to:
    /( (?: c* | ) )/x
    to break the loop.

    Similarly for a more complex zero-length expression such as a lookahead:

    /( (?: a | (?=c) ) )*/x
    is treated as equivalent to:
    /( (?: a* | (?=c) ) )/x

    Hope this helps,

    Hugo

      /( (?: a* | (?=c) ) )/x

      This doesn't make any sense - a* cannot fail so you'll
      never end up in the position to try the second
      alternative.

      /( (?: a | (?=c) ) )*/x

      This means - match as many 'a'-s as you can and when
      this becomes impossible try the second alternative -
      if it machtes (a zero-width) you'll face an infinite
      loop that you want to break - and you does this by
      allowing only one such zero-width match to happen.

        Yes, I was thinking as I wrote my reply that it would make more sense to break it as:

        /( (?: a+ | (?=c) ) )/x

        But I didn't want to introduce unnecessary complications for the OP, and I wasn't entirely sure there was no deep reason I was missing as to why the docs show /a*/ rather than /a+/ for this.

        Hugo



        ok, the first paragraph is wrong - if you backtrack
        you _can_ use the second alternative but fact is that
        you'll try matching (?=c) at position 0 not at some
        position after the last 'a' in the string which is the
        case in /( (?: a | (?=c) ) )*/x.
        The breaking of the infinite loop only forces (?=c) to
        be tried only once but doesn't redefine the position
        at which this happens.

Re: ZERO_LENGTH match
by Anonymous Monk on Aug 01, 2005 at 15:26 UTC
    What do you with the matches not being the same?
    $_ = 'aaabbb'; while (/((?: a | c? )*)/xg) { printf "1: '%s' '%d'\n", $&, pos; } while (/(( a )* | ( c? )?)/xg) { printf "2: '%s' '%d'\n", $&, pos; } __END__ 1: 'aaa' '3' 1: '' '3' 1: '' '4' 1: '' '5' 1: '' '6' 2: 'aaa' '3' 2: '' '3' 2: '' '4' 2: '' '5' 2: '' '6'
      Without 'g' modifier:
      1.>perl -e 'use re(debug); $_ = 'aaabbb'; /(( a | c? )*)/x' 2.>perl -e 'use re(debug); $_ = 'aaabbb'; /(( a )* | ( c? )?)/x'