SuicideJunkie has asked for the wisdom of the Perl Monks concerning the following question:

It appears that the combination of repetition, alternation and captures can cause the captures to "leak".

Seen on 5.10 and 5.8:
use strict; #use warnings; #printing lots of undefs print "Enter your test strings:\n"; while (<main::DATA>) { chomp; print "\tTesting '$_':\n"; /^(?:(?:(\d+)\s*c\s*)|(?:(\d+)\s*w\s*)|(?:(\d+)\s*r\s*))+/i; print "Capturing \\d+ only: 1='$1', 2='$2', 3='$3'\n"; /^(?:(?:(\d+\s*c)\s*)|(?:(\d+\s*w)\s*)|(?:(\d+\s*r)\s*))+/i; print "Capturing \\d+ plus the letter: 1='$1', 2='$2', 3='$3'\n"; } __DATA__ 1c 2w 2c3w 1w1w 1w2r 2r1c

Note that the only difference in the regexes is the placement of the capture's closing bracket.

The above prints:
Enter your test strings: Testing '1c': Capturing \d+ only: 1='1', 2='', 3='' Capturing \d+ plus the letter: 1='1c', 2='', 3='' Testing '2w': Capturing \d+ only: 1='', 2='2', 3='' Capturing \d+ plus the letter: 1='', 2='2w', 3='' Testing '2c3w': Capturing \d+ only: 1='3', 2='3', 3='' Capturing \d+ plus the letter: 1='2c', 2='3w', 3='' Testing '1w1w': Capturing \d+ only: 1='1', 2='1', 3='' Capturing \d+ plus the letter: 1='', 2='1w', 3='' Testing '1w2r': Capturing \d+ only: 1='2', 2='2', 3='2' Capturing \d+ plus the letter: 1='', 2='1w', 3='2r' Testing '2r1c': Capturing \d+ only: 1='1', 2='', 3='2' Capturing \d+ plus the letter: 1='1c', 2='', 3='2r'

The second regex does what I expect. I can't fathom why the first regex would do what it does, however. I expected it would be the same as the second regex, minus the letters. Instead, after the first repetition, it seems that a match for $2 spills a copy into $1, and a match for $3 spills copies into $2 and $1... but only if the captures contain the same regex pattern (\d+ in this case).

Would I be correct to suspect that this is a bug or mis-optimization of some kind in perl?


The monks can fry your fish, and they can give you some tips and some bait, but you still need to wake up in the morning and climb onto the boat.

Replies are listed 'Best First'.
Re: Leaking Regex Captures
by jwkrahn (Abbot) on Aug 04, 2009 at 15:00 UTC

    You have to understand that the numerical variables retain their values from the last successful match and '1c' is matched by the first capturing parentheses, and '2w', '2c3w', and '1w1w' are captured by the second capturing parentheses, and '1w2r' and '2r1c' are captured by the third capturing parentheses so the values returned by the other capturing parentheses are not valid and/or undefined.   To get valid results only use the contents of the capturing parentheses that actually matched:

    use strict; use warnings; print "Enter your test strings:\n"; while ( <DATA> ) { chomp; print "\tTesting '$_':\n"; /^(?:(?:(\d+)\s*c\s*)|(?:(\d+)\s*w\s*)|(?:(\d+)\s*r\s*))+/i and pr +int "Capturing \\d+ only: '$+'\n"; /^(?:(?:(\d+\s*c)\s*)|(?:(\d+\s*w)\s*)|(?:(\d+\s*r)\s*))+/i an +d print "Capturing \\d+ plus the letter: '$+'\n"; } __DATA__ 1c 2w 2c3w 1w1w 1w2r 2r1c

      I have been trying to understand how SuicideJunkie's code causes the results it does, and I am getting lost.
      jwkrahn - do you mean that $1 etc... are not being reset if they do not match? So as SuicideJunkie asked - how come they seem to inherit the value of the 'next' match? i.e. $2 = $3 (but only if $3 matches first...)?

      I also found that the \s* part of the regex is causing some of the problem - i.e. see regex 2 below - in isolation it works as expected),
      but then i moved around the order of the regexes and came back the same problem with the 'fixed' regex 2 now 'inheriting' the faulty results from regex 1.

      This behaviour really confuses me! And sorry to SuidiceJunkie again for jumping on his node!!

      Just a something something...

      '2c3w' cannot be matched only by the second parentheses; the first parentheses must match as well, otherwise the entire match would fail. Given that both the first and the second must have matched successfully, if both $1 and $2 should "retain their values from the last successful match", then $1 should be 2, not 3.

      Since /^(?:(?:(\d+(?![rw]))\s*c\s*)|(?:(\d+(?![rc]))\s*w\s*)|(?:(\d+(?![cw]))\s*r\s*))+/i; also works as expected, it seems to me that the definition of "last successful match" might be changing between runs of a repetition.

      On the first pass, successful match requires the whole alternation to match before it sets the capture variable, but on subsequent repeats, only the parenthesis need to match before it changes $1?

      $_ = 'bb ca de'; /(?:(.)b|.)+/i; print "Test: 1='$1', 2='$2'\n"; # Prints: Test: 1='e', 2='' # vs $_ = 'e'; /(?:(.)b|.)+/i; print "Test: 1='$1', 2='$2'\n"; # Prints: Test: 1='', 2='' # BUT! $_ = 'efg'; /(?:(.)b|.)+/i; print "Test: 1='$1', 2='$2'\n"; # Prints: Test: 1='', 2=''

      This is all quite strange. The '1w1w' test shows that you don't need $1 to be set in order for it to be stomped, so I've no idea why the 'efg' didn't fail.

      All I wanted was to allow users to enter their options in any order!


      PS: How does one tell which capture matched, if there is garbage in the other capture variables?

        '2c3w' cannot be matched only by the second parentheses; the first parentheses must match as well, otherwise the entire match would fail.

        Incorrect.   You are using alternation so only one of the alternatives has to match for the entire match to be successful.

        Given that both the first and the second must have matched successfully,

        Using alternation only one or the other can match successfully, but not both at the same time.

        Update:

        PS: How does one tell which capture matched, if there is garbage in the other capture variables?

        From perlvar:

        One can use "$#-" to find the last matched subgroup in the last successful match.

      No, this is simply a long-standing bug in the implementation of captures. $1 remaining unchanged is supposed to happen if the entire regex fails to match.

      When the regex backtracks over a completed capture, it needs to clear out that previously filled-in capture. Please 'perlbug' it.

      - tye        

Re: Leaking Regex Captures
by ELISHEVA (Prior) on Aug 04, 2009 at 18:10 UTC

    It looks to me like the regex is getting confused when it is backtracking. As BioLion notes above, jwrahn's explanation fits the output perfectly if we remove the \s* between each letter and digit, but it doesn't fit the output when the \s* is still in place.

    while (<main::DATA>) { chomp; print "\nTesting '$_'\n"; /^(?:(?:(\d+)c\s*)|(?:(\d+)w\s*)|(?:(\d+)r\s*))+/i; print "Without \\s* : 1='$1', 2='$2', 3='$3'\n"; /^(?:(?:(\d+)\s*c\s*)|(?:(\d+)\s*w\s*)|(?:(\d+)\s*r\s*))+/i; print "With \\s* : 1='$1', 2='$2', 3='$3'\n"; }

    outputs

    Testing '1c' Without \s* : 1='1', 2='', 3='' With \s* : 1='1', 2='', 3='' Testing '2w' Without \s* : 1='', 2='2', 3='' With \s* : 1='', 2='2', 3='' Testing '2c3w' Without \s* : 1='2', 2='3', 3='' With \s* : 1='3', 2='3', 3='' Testing '1w1w' Without \s* : 1='', 2='1', 3='' With \s* : 1='1', 2='1', 3='' Testing '1w2r' Without \s* : 1='', 2='1', 3='2' With \s* : 1='2', 2='2', 3='2' Testing '2r1c' Without \s* : 1='1', 2='', 3='2' With \s* : 1='1', 2='', 3='2'

    Best, beth

Re: Leaking Regex Captures
by moritz (Cardinal) on Aug 04, 2009 at 15:01 UTC
    I agree with your expected output, and that perl gives you a wrong result. I'm not competent enough to comment on your analysis, though.

    Update and of course I'm wrong. See jwkrahn's reply below. Ouch.

    I'm already thinking in terms of Perl 6, where the $0, $1, $2 etc. are aliases into the match object in $/. There you can't get $2 or so leaking from the previous match, and everything is pretty much transparent.

Re: Leaking Regex Captures
by Anonymous Monk on Aug 05, 2009 at 00:08 UTC
    use re 'debug';
      That's quite handy, and thanks for posting it, but sadly it does not explain why the marked branch sets the value of $1 to 'g' even though it "failed..." to match:
      3 <ebf> <g> | 3: BRANCH(11) 3 <ebf> <g> | 4: OPEN1(6) 3 <ebf> <g> | 6: REG_ANY(7) 4 <ebfg> <> | 7: CLOSE1(9) 4 <ebfg> <> | 9: EXACTF <b>(14) failed... 3 <ebf> <g> | 11: BRANCH(13)

        That's quite handy, and thanks for posting it, but sadly it does not explain why the marked branch sets the value of $1 to 'g' even though it "failed..." to match:

        Match successful! means it NOT fail.

Re: Leaking Regex Captures
by Marshall (Canon) on Aug 05, 2009 at 14:47 UTC
    I am not sure what you want.
    It would be helpful if you could give an OUTPUT section like you have a DATA section.
    Why does this have to be so complex?
    Update: small formatting change.
    #!/usr/bin/perl -w use strict; while (<DATA>) { print "testing: $_"; chomp; my @digits = m/\d+/g; print "digits only: @digits\n"; my @numletters = m/\d[^\d]+/g; print "digits_and_letters:@numletters\n\n"; } #Prints: #testing: 1c #digits only: 1 #digits_and_letters:1c # #testing: 2w #digits only: 2 #digits_and_letters:2w # #testing: 2c3w #digits only: 2 3 #digits_and_letters:2c 3w # #testing: 1w1w #digits only: 1 1 #digits_and_letters:1w 1w # #testing: 1w2r #digits only: 1 2 #digits_and_letters:1w 2r # #testing: 2r1c #digits only: 2 1 #digits_and_letters:2r 1c __DATA__ 1c 2w 2c3w 1w1w 1w2r 2r1c
      Note that this is very closely related to the context of: Re: Regex - Matching prefixes of a word

      The original goal of the regex is to match a command string similar to:
      beam 15 crew 5 wounded 2 critical to S.S.Kevorkian
      Where the number-type pairs are optional and may appear in any order, provided that there is at least one of the pairs present. (No point in beaming nobody over)

      Thus, the (\d+)\s*literals form of each piece,
      and the (?: (capture)X | (capture)Y | (capture)Z )+ overall structure.
      Wrapped around that structure is a /^(?:$regexSubstringOf{beam}|$regexSubstringOf{transport}\s* )\s*(?:$structure)\s+(?:to\s+)?$regexObjectName\s*$/i

      And then it all ends up in an addCommand('transport', {crew=>$1,wound=>$2,crit=>$3},$4) if $cmd =~ /regex/i; ($4 is the ship name, captured by the $regexObjectName)


      What I have done to work around the problem is to capture the whole pair, and then inside the addCommand() function, I fire off some more regex to s/\D//g the hash values if they are defined.
      I also have to add a negative lookahead in the captures to prevent '5 crit' from matching as a substring of 'crew': "5cr" and stomping the $1 value before backtracking kicks in.



      To sum up; I want the numbers out of those pairs, with $1 = Number of healthy Crew, $2 = number of wounded, $3 = number of critically injured.
      How I get them is not important, and for multiple copies of them in the command string I don't care which one gets picked, although consistency is desirable and the last one is better than the first since that means a user can just keep typing if they make a mistake, instead of backspacing up to change the number.

        Well, how about this....?
        #!/usr/bin/perl -w use strict; while (<DATA>) { print "testing: $_"; chomp; my @pairs = m/(\d+)\s+(\w+)/g; print "@pairs\n\n"; } #Prints: #testing: beam 15 crew 5 wounded 2 critical to S.S.Kevorkian #15 crew 5 wounded 2 critical # #testing: oh, my gosh, darn 5 killed 2 want_sex_change 10 drunk #5 killed 2 want_sex_change 10 drunk # #testing: what a day:5 wounded 2 critical 20 crew #5 wounded 2 critical 20 crew # #testing: 20 crew and 6 killed and 14 MIA #20 crew 6 killed 14 MIA __DATA__ beam 15 crew 5 wounded 2 critical to S.S.Kevorkian oh, my gosh, darn 5 killed 2 want_sex_change 10 drunk what a day:5 wounded 2 critical 20 crew 20 crew and 6 killed and 14 MIA