Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Consider the following program:
perl -e '"01234" =~ /^(.+)(.+)(.+)$(?{ print "$1 $2 $3\n" })(*FAIL)/'
It outputs all six possible splits:
012 3 4 01 23 4 01 2 34 0 123 4 0 12 34 0 1 234
Now if I change the third capture slightly (add optional "z" that should never match)
perl -e '"01234" =~ /^(.+)(.+)((?:.z?)+)$(?{ print "$1 $2 $3\n" })(*FA +IL)/'
then the program outputs only five variants and "0 1 234" disappears. With more complex regexps a whole lot of variants are missing. Is it a bug? Tested with Perl 5.22.2.

Replies are listed 'Best First'.
Re: Bug with finding all regexp matches
by AnomalousMonk (Archbishop) on Oct 15, 2016 at 16:08 UTC
    perl -e '"01234" =~ /^(.+)(.+)((?:.z?)+)$(?{ print "$1 $2 $3\n" })(*FAIL)/'
    ... program outputs only five variants and "0 1 234" disappears.

    I see the same behavior on Strawberries 5.10.1.5, 5.12.3.0, 5.14.4.1 and ActiveState 5.8.9. (Have to use  (?!) in place of  (*FAIL) previous to 5.10, but these patterns are exactly equivalent.)

    Why is this behavior seen? Dunno. Will have to think about this a bit.


    Give a man a fish:  <%-{-{-{-<

      Also changing the first capture to non-greedy should not affect the number of results (regexp is anchored on both sides), yet it does (for z? case, things work OK without it):
      perl -e '"01234" =~ /^(.+?)(.+)((?:.z?)+)$(?{ print "$1 $2 $3\n" })(*F +AIL)/' 0 123 4 0 12 34 0 1 234 01 23 4
      It looks like a bug. It smells like a bug. It ate my whole project. But... Is it a bug?
Re: Bug with finding all regexp matches
by BrowserUk (Patriarch) on Oct 15, 2016 at 17:36 UTC
    Is it a bug?

    If you enable re 'debug', then you'll get to see what's going on in detail. The compiled regexes differ in much the way you'd expect:

    Final program: Final program: 1: BOL (2) 1: BOL (2) 2: OPEN1 (4) 2: OPEN1 (4) 4: PLUS (6) 4: PLUS (6) 5: REG_ANY (0) 5: REG_ANY (0) 6: CLOSE1 (8) 6: CLOSE1 (8) 8: OPEN2 (10) 8: OPEN2 (10) 10: PLUS (12) 10: PLUS (12) 11: REG_ANY (0) 11: REG_ANY (0) 12: CLOSE2 (14) 12: CLOSE2 (14) 14: OPEN3 (16) 14: OPEN3 (16) 16: PLUS (18) 16: CURLYX[2] {1,3276 +7} (24) 17: REG_ANY (0) 18: REG_ANY (19) 19: CURLY {0,1} (23 +) 21: EXACT <z> (0) 23: WHILEM[1/1] (0) 24: NOTHING (25) 18: CLOSE3 (20) 25: CLOSE3 (27) 20: EOL (21) 27: EOL (28) 21: EVAL (23) 28: EVAL (30) 23: OPFAIL (24) 30: OPFAIL (31) 24: END (0) 31: END (0)

    Comparing the traces is harder.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      For
      perl -e 'use re "debug"; "01234" =~ /^(.+)(.+)((?:.z?)+)$(?{ print "$1 + $2 $3\n" })(*FAIL)/' 2>&1 | less
      for the missing "0 1 234" the trace has
      whilem: (cache) already tried at this position... failed...
      and my wild guess is that it cached the effect of (*FAIL) and not the real failure of a regexp, hence EVAL is skipped and last result is missed. I didn't find the way to disable the cache with "use re".
        To add to the previous post: For
        perl -e 'use re "debug"; "01234" =~ /^(.+?)(.+)((?:.z?)+)$(?{ print "$ +1 $2 $3\n" })(*FAIL)/' 2>&1 | less
        two results are missing and there are two entries
        whilem: (cache) already tried at this position... failed...
        It figures.
Re: Bug with finding all regexp matches
by kroki (Novice) on Oct 16, 2016 at 09:08 UTC

    (had to register to see past Anonymous Monk's default depth limit)

    For the record: I'm convinced it's not a bug but a feature, and asked to close my bug report (which is rejected now).

    I also figured out that it's possible to disable caching by making (*FAIL) conditional:

    perl -e '"01234" =~ /^(.+?)(.+)((?:.z?)+)$(?(?{ print "$1 $2 $3\n" })( +*FAIL)|(*ACCEPT))/'
    I won't depend on this myself though, because this useful feature may too one day be "optimized and documented away" :)

      (?(?{ print "$1 $2 $3\n" })(*FAIL)|(*ACCEPT))

      I don't understand the purpose of the  (*ACCEPT) false clause in the quoted code fragment. Because the print will always return true (unless there's some terrible I/O failure :), this clause will never be executed. The following versions of the code (with and without (*ACCEPT)) test the same in all 5.10+ Perl versions I have in captivity (see this):

      use 5.010; # need (?(?{ code }) pattern) use Test::More 'no_plan'; use Test::NoWarnings; note 'perl version ', $]; for my $rw ('.+', '.+ w?') { for my $rx ('.+', '.+ x?') { for my $ry ('.+', '(?: . y?)+') { my $captures = qr{ ($rw) ($rx) ($ry) }xms; local our @ra; use re 'eval'; '01234' =~ m{ \A $captures \z # (?(?{ push @ra, [ $1, $2, $3 ] }) (*F) | (*ACCEPT)) (?(?{ push @ra, [ $1, $2, $3 ] }) (*F)) }xms; is_deeply \@ra, [ [ qw(012 3 4) ], [ qw(01 23 4) ], [ qw(01 2 34) ], [ qw(0 123 4) ], [ qw(0 12 34) ], [ qw(0 1 234) ], ], $captures; } # end for $ry } # end for $rx } # end for $rw done_testing;
      (Of course, the push statement always returns true.) The true magick seems to reside in the use of the regex conditional expression.

      Output:

      c:\@Work\Perl\monks\kroki>perl permute_via_regex_1.pl # perl version 5.014004 ok 1 - (?^msx: (.+) (.+) (.+) ) ok 2 - (?^msx: (.+) (.+) ((?: . y?)+) ) ok 3 - (?^msx: (.+) (.+ x?) (.+) ) ok 4 - (?^msx: (.+) (.+ x?) ((?: . y?)+) ) ok 5 - (?^msx: (.+ w?) (.+) (.+) ) ok 6 - (?^msx: (.+ w?) (.+) ((?: . y?)+) ) ok 7 - (?^msx: (.+ w?) (.+ x?) (.+) ) ok 8 - (?^msx: (.+ w?) (.+ x?) ((?: . y?)+) ) 1..8 ok 9 - no warnings 1..9

      Of course, your final thought still holds true: none of this is guaranteed against future regex engine optimizations and other "improvements"!


      Give a man a fish:  <%-{-{-{-<