davido has asked for the wisdom of the Perl Monks concerning the following question:

Consider the following code:

local $_ = 'foo'; say 'Start' if /\G foo/gcx; say 'Mid' if /\G .*/gcx; say 'End' if /\G \z/gcx;

The output will be?

Start Mid

Change the "Mid" case to this:

say 'Mid' if /\G .+/gcx;

And now the output will be:

Start Mid End

So all three conditions match. If you use the following quantifiers at the end of the 2nd expression, /z will not match in the third expression:*, ?, {0,}.

This is confirmed on Perl 5.26, and 5.10.

Similarly:

perl -E 'local $_ = "foo\n"; say "Start" if /\G foo/gcx; say "Mid" if +/\G .*/gcx; say "End" if /\G (?=\n)/gcx'

So in this case we added a \n to the string, matched on .* for our "Mid" expression. Then did a lookahead assertion for \n in the "End" expression. Since we are not using the /s modifier, .* should have stopped before \n, so (?=\n) should still find newline (I think), so the "End" condition should be true.

I'm feeling like the difference between how .+ and .* are consuming the string (/z matching in the 3rd expression when the 2nd expression uses .+, but not matching if .*) is an inconsistency that can't be defended as not being a bug, but I'm interested in what others take on it might be.


Dave

Replies are listed 'Best First'.
Re: Seeking clarification on possible bug in regex using \G and /gc
by choroba (Cardinal) on Mar 14, 2018 at 23:05 UTC
    I'm not able to reproduce the behaviour you describe. Tried in 5.18.2 and 5.27.7, code:
    #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; { local $_ = 'foo'; say '1 Start' if /\G foo/gcx; say '1 Mid' if /\G .*/gcx; say '1 End' if /\G \z/gcx; } { local $_ = 'foo'; say '2 Start' if /\G foo/gcx; say '2 Mid' if /\G .+/gcx; say '2 End' if /\G \z/gcx; }

    Output:

    1 Start 1 Mid 2 Start 2 End
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thanks choroba for setting me straight. If there's one thing that I keep forgetting to learn it's not to post regex questions or answers in haste. ;)

      You are correct. This:

      local $_ = 'foo'; m/\Gfoo/gc && say 'Matched foo'; m/\G.+/gc && say 'Matched dot star'; m/\G\z/gc && say 'Matched end of string';

      ...produces this:

      Matched foo Matched end of string

      And this:

      local $_ = 'foo'; m/\Gfoo/gc && say 'Matched foo'; m/\G.*/gc && say 'Matched dot star'; m/\G\z/gc && say 'Matched end of string';

      ...produces this:

      Matched foo Matched dot star

      ...indicating that end of string was consumed by dot star, though it's still a little odd because this also matches:

      local $_ = 'foo'; m/\Gfoo/gc && say 'Matched foo'; m/\G.*\z/gc && say 'Matched dot star and end of string';

      ...produces this:

      Matched foo Matched dot star and end of string

      So while it may be eluded to in the documentation it's not entirely unsurprising. :)


      Dave

      Breaks for me on perl 5.26.
      perl -E 'local $_ = "foo"; say "Start" if /\G foo/gcx; say "Mid" if /\ +G .*/gcx; say "End" if /\G \z/gcx' Start Mid
      Should instead print
      Start Mid End
      Here is a more extended item looking at pos as well.
      perl -E 'local $_ = "foo"; say "Start" if /\G foo/gcx; say pos; say "M +id" if /\G .*/gcx; say pos; say "End" if /\G \z/gcx; say "At End" if +pos == length' Start 3 Mid 3 At End
      my @a=qw(random brilliant braindead); print $a[rand(@a)];
        > Should instead print

        Why? Using perl -Mre=debug will show you your mistake:

        ... Matching REx "\G \z" against "" 3 <foo> <> | 1:GPOS(2) 3 <foo> <> | 2:EOS(3) 3 <foo> <> | 3:END(0) Match possible, but length=0 is smaller than requested=1, failing! Match failed

        The length=1 request comes from the previously mentioned Repeated Patterns Matching a Zero-length Substring:

        The higher-level loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Seeking clarification on possible bug in regex using \G and /gc
by tybalt89 (Monsignor) on Mar 14, 2018 at 23:10 UTC

    See "Repeated Patterns Matching a Zero-length Substring" in perlre.

      The closest mention I could see to this problem and perhaps a clue is: The additional state of being matched with zero-length is associated with the matched string, and is reset by each assignment to "pos()". So perhaps the issue is that the previous match succeeded - but did not advance pos, so when the /\G \z/ hits, pos doesn't advance and for some reason perl doesn't treat it as a successful match. I'm still with davido. I think this is a bit of a bug - it introduces failure at a distance in custom parsing engines. Luckily for cases such as these pos == length happens to be true.
      my @a=qw(random brilliant braindead); print $a[rand(@a)];

        The opposite would be worse. If it didn't work like this, subtle differences would be introduced when you refactor some code, merging regexen or splitting them apart.