in reply to Leaking Regex Captures

You have to understand that the numerical variables retain their values from the last successful match and '1c' is matched by the first capturing parentheses, and '2w', '2c3w', and '1w1w' are captured by the second capturing parentheses, and '1w2r' and '2r1c' are captured by the third capturing parentheses so the values returned by the other capturing parentheses are not valid and/or undefined.   To get valid results only use the contents of the capturing parentheses that actually matched:

use strict; use warnings; print "Enter your test strings:\n"; while ( <DATA> ) { chomp; print "\tTesting '$_':\n"; /^(?:(?:(\d+)\s*c\s*)|(?:(\d+)\s*w\s*)|(?:(\d+)\s*r\s*))+/i and pr +int "Capturing \\d+ only: '$+'\n"; /^(?:(?:(\d+\s*c)\s*)|(?:(\d+\s*w)\s*)|(?:(\d+\s*r)\s*))+/i an +d print "Capturing \\d+ plus the letter: '$+'\n"; } __DATA__ 1c 2w 2c3w 1w1w 1w2r 2r1c

Replies are listed 'Best First'.
Re^2: Leaking Regex Captures
by BioLion (Curate) on Aug 04, 2009 at 15:45 UTC

    I have been trying to understand how SuicideJunkie's code causes the results it does, and I am getting lost.
    jwkrahn - do you mean that $1 etc... are not being reset if they do not match? So as SuicideJunkie asked - how come they seem to inherit the value of the 'next' match? i.e. $2 = $3 (but only if $3 matches first...)?

    I also found that the \s* part of the regex is causing some of the problem - i.e. see regex 2 below - in isolation it works as expected),
    but then i moved around the order of the regexes and came back the same problem with the 'fixed' regex 2 now 'inheriting' the faulty results from regex 1.

    This behaviour really confuses me! And sorry to SuidiceJunkie again for jumping on his node!!

    Just a something something...
Re^2: Leaking Regex Captures
by SuicideJunkie (Vicar) on Aug 04, 2009 at 16:23 UTC

    '2c3w' cannot be matched only by the second parentheses; the first parentheses must match as well, otherwise the entire match would fail. Given that both the first and the second must have matched successfully, if both $1 and $2 should "retain their values from the last successful match", then $1 should be 2, not 3.

    Since /^(?:(?:(\d+(?![rw]))\s*c\s*)|(?:(\d+(?![rc]))\s*w\s*)|(?:(\d+(?![cw]))\s*r\s*))+/i; also works as expected, it seems to me that the definition of "last successful match" might be changing between runs of a repetition.

    On the first pass, successful match requires the whole alternation to match before it sets the capture variable, but on subsequent repeats, only the parenthesis need to match before it changes $1?

    $_ = 'bb ca de'; /(?:(.)b|.)+/i; print "Test: 1='$1', 2='$2'\n"; # Prints: Test: 1='e', 2='' # vs $_ = 'e'; /(?:(.)b|.)+/i; print "Test: 1='$1', 2='$2'\n"; # Prints: Test: 1='', 2='' # BUT! $_ = 'efg'; /(?:(.)b|.)+/i; print "Test: 1='$1', 2='$2'\n"; # Prints: Test: 1='', 2=''

    This is all quite strange. The '1w1w' test shows that you don't need $1 to be set in order for it to be stomped, so I've no idea why the 'efg' didn't fail.

    All I wanted was to allow users to enter their options in any order!


    PS: How does one tell which capture matched, if there is garbage in the other capture variables?

      '2c3w' cannot be matched only by the second parentheses; the first parentheses must match as well, otherwise the entire match would fail.

      Incorrect.   You are using alternation so only one of the alternatives has to match for the entire match to be successful.

      Given that both the first and the second must have matched successfully,

      Using alternation only one or the other can match successfully, but not both at the same time.

      Update:

      PS: How does one tell which capture matched, if there is garbage in the other capture variables?

      From perlvar:

      One can use "$#-" to find the last matched subgroup in the last successful match.

        '2c3w' cannot be matched only by the second parentheses; the first parentheses must match as well, otherwise the entire match would fail.
        Incorrect. You are using alternation so only one of the alternatives has to match for the entire match to be successful.

        The alternation is repeated with the + so that multiple branches can match. And the regex is anchored with an '^' so in order for the '3w' to match, the '2c' must match first. Not at the same time, but they both do match on the same string.

        Adding a '$' anchor does not change the symptoms, and was left out of the example.

        Perhaps a stepwise commented example would make it clear what my issue is.

        $1 should DEFINITELY not be 'c'!
        Where did the 'a' go?


        Compare with: And this time, $1 was handled sensibly.
Re^2: Leaking Regex Captures (bug!)
by tye (Sage) on Aug 05, 2009 at 14:33 UTC

    No, this is simply a long-standing bug in the implementation of captures. $1 remaining unchanged is supposed to happen if the entire regex fails to match.

    When the regex backtracks over a completed capture, it needs to clear out that previously filled-in capture. Please 'perlbug' it.

    - tye