Re: Leaking Regex Captures
by jwkrahn (Abbot) on Aug 04, 2009 at 15:00 UTC
|
You have to understand that the numerical variables retain their values from the last successful match and '1c' is matched by the first capturing parentheses, and '2w', '2c3w', and '1w1w' are captured by the second capturing parentheses, and '1w2r' and '2r1c' are captured by the third capturing parentheses so the values returned by the other capturing parentheses are not valid and/or undefined. To get valid results only use the contents of the capturing parentheses that actually matched:
use strict;
use warnings;
print "Enter your test strings:\n";
while ( <DATA> )
{
chomp;
print "\tTesting '$_':\n";
/^(?:(?:(\d+)\s*c\s*)|(?:(\d+)\s*w\s*)|(?:(\d+)\s*r\s*))+/i and pr
+int "Capturing \\d+ only: '$+'\n";
/^(?:(?:(\d+\s*c)\s*)|(?:(\d+\s*w)\s*)|(?:(\d+\s*r)\s*))+/i an
+d print "Capturing \\d+ plus the letter: '$+'\n";
}
__DATA__
1c
2w
2c3w
1w1w
1w2r
2r1c
| [reply] [d/l] |
|
|
I have been trying to understand how SuicideJunkie's code causes the results it does, and I am getting lost.
jwkrahn - do you mean that $1 etc... are not being reset if they do not match? So as SuicideJunkie asked - how come they seem to inherit the value of the 'next' match? i.e. $2 = $3 (but only if $3 matches first...)?
I also found that the \s* part of the regex is causing some of the problem - i.e. see regex 2 below - in isolation it works as expected), but then i moved around the order of the regexes and came back the same problem with the 'fixed' regex 2 now 'inheriting' the faulty results from regex 1.
This behaviour really confuses me! And sorry to SuidiceJunkie again for jumping on his node!!
Just a something something...
| [reply] [d/l] [select] |
|
|
'2c3w' cannot be matched only by the second parentheses; the first parentheses must match as well, otherwise the entire match would fail. Given that both the first and the second must have matched successfully, if both $1 and $2 should "retain their values from the last successful match", then $1 should be 2, not 3.
Since /^(?:(?:(\d+(?![rw]))\s*c\s*)|(?:(\d+(?![rc]))\s*w\s*)|(?:(\d+(?![cw]))\s*r\s*))+/i; also works as expected, it seems to me that the definition of "last successful match" might be changing between runs of a repetition. On the first pass, successful match requires the whole alternation to match before it sets the capture variable, but on subsequent repeats, only the parenthesis need to match before it changes $1?
$_ = 'bb ca de';
/(?:(.)b|.)+/i;
print "Test: 1='$1', 2='$2'\n";
# Prints: Test: 1='e', 2=''
# vs
$_ = 'e';
/(?:(.)b|.)+/i;
print "Test: 1='$1', 2='$2'\n";
# Prints: Test: 1='', 2=''
# BUT!
$_ = 'efg';
/(?:(.)b|.)+/i;
print "Test: 1='$1', 2='$2'\n";
# Prints: Test: 1='', 2=''
This is all quite strange. The '1w1w' test shows that you don't need $1 to be set in order for it to be stomped, so I've no idea why the 'efg' didn't fail.
All I wanted was to allow users to enter their options in any order!
PS: How does one tell which capture matched, if there is garbage in the other capture variables? | [reply] [d/l] [select] |
|
|
'2c3w' cannot be matched only by the second parentheses; the first parentheses must match as well, otherwise the entire match would fail.
Incorrect.   You are using alternation so only one of the alternatives has to match for the entire match to be successful.
Given that both the first and the second must have matched successfully,
Using alternation only one or the other can match successfully, but not both at the same time.
Update:
PS: How does one tell which capture matched, if there is garbage in the other capture variables?
From perlvar:
One can use "$#-" to find the last matched subgroup in the last successful match.
| [reply] [d/l] |
|
|
|
|
|
|
No, this is simply a long-standing bug in the implementation of captures. $1 remaining unchanged is supposed to happen if the entire regex fails to match.
When the regex backtracks over a completed capture, it needs to clear out that previously filled-in capture. Please 'perlbug' it.
| [reply] |
Re: Leaking Regex Captures
by ELISHEVA (Prior) on Aug 04, 2009 at 18:10 UTC
|
It looks to me like the regex is getting confused when it is backtracking. As BioLion notes above, jwrahn's explanation fits the output perfectly if we remove the \s* between each letter and digit, but it doesn't fit the output when the \s* is still in place.
while (<main::DATA>)
{
chomp;
print "\nTesting '$_'\n";
/^(?:(?:(\d+)c\s*)|(?:(\d+)w\s*)|(?:(\d+)r\s*))+/i;
print "Without \\s* : 1='$1', 2='$2', 3='$3'\n";
/^(?:(?:(\d+)\s*c\s*)|(?:(\d+)\s*w\s*)|(?:(\d+)\s*r\s*))+/i;
print "With \\s* : 1='$1', 2='$2', 3='$3'\n";
}
outputs
Testing '1c'
Without \s* : 1='1', 2='', 3=''
With \s* : 1='1', 2='', 3=''
Testing '2w'
Without \s* : 1='', 2='2', 3=''
With \s* : 1='', 2='2', 3=''
Testing '2c3w'
Without \s* : 1='2', 2='3', 3=''
With \s* : 1='3', 2='3', 3=''
Testing '1w1w'
Without \s* : 1='', 2='1', 3=''
With \s* : 1='1', 2='1', 3=''
Testing '1w2r'
Without \s* : 1='', 2='1', 3='2'
With \s* : 1='2', 2='2', 3='2'
Testing '2r1c'
Without \s* : 1='1', 2='', 3='2'
With \s* : 1='1', 2='', 3='2'
Best, beth | [reply] [d/l] [select] |
Re: Leaking Regex Captures
by moritz (Cardinal) on Aug 04, 2009 at 15:01 UTC
|
I agree with your expected output, and that perl gives you a wrong result. I'm not competent enough to comment on your analysis, though.
Update and of course I'm wrong. See jwkrahn's reply below. Ouch.
I'm already thinking in terms of Perl 6, where the $0, $1, $2 etc. are aliases into the match object in $/. There you can't get $2 or so leaking from the previous match, and everything is pretty much transparent. | [reply] [d/l] |
Re: Leaking Regex Captures
by Anonymous Monk on Aug 05, 2009 at 00:08 UTC
|
| [reply] [d/l] |
|
|
That's quite handy, and thanks for posting it, but sadly it does not explain why the marked branch sets the value of $1 to 'g' even though it "failed..." to match:
3 <ebf> <g> | 3: BRANCH(11)
3 <ebf> <g> | 4: OPEN1(6)
3 <ebf> <g> | 6: REG_ANY(7)
4 <ebfg> <> | 7: CLOSE1(9)
4 <ebfg> <> | 9: EXACTF <b>(14)
failed...
3 <ebf> <g> | 11: BRANCH(13)
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
|
|
Re: Leaking Regex Captures
by Marshall (Canon) on Aug 05, 2009 at 14:47 UTC
|
I am not sure what you want.
It would be helpful if you could give an OUTPUT section like you have a DATA section.
Why does this have to be so complex?
Update: small formatting change.
#!/usr/bin/perl -w
use strict;
while (<DATA>)
{
print "testing: $_";
chomp;
my @digits = m/\d+/g;
print "digits only: @digits\n";
my @numletters = m/\d[^\d]+/g;
print "digits_and_letters:@numletters\n\n";
}
#Prints:
#testing: 1c
#digits only: 1
#digits_and_letters:1c
#
#testing: 2w
#digits only: 2
#digits_and_letters:2w
#
#testing: 2c3w
#digits only: 2 3
#digits_and_letters:2c 3w
#
#testing: 1w1w
#digits only: 1 1
#digits_and_letters:1w 1w
#
#testing: 1w2r
#digits only: 1 2
#digits_and_letters:1w 2r
#
#testing: 2r1c
#digits only: 2 1
#digits_and_letters:2r 1c
__DATA__
1c
2w
2c3w
1w1w
1w2r
2r1c
| [reply] [d/l] |
|
|
Note that this is very closely related to the context of: Re: Regex - Matching prefixes of a word
The original goal of the regex is to match a command string similar to:
beam 15 crew 5 wounded 2 critical to S.S.Kevorkian
Where the number-type pairs are optional and may appear in any order, provided that there is at least one of the pairs present. (No point in beaming nobody over)
Thus, the (\d+)\s*literals form of each piece, and the (?: (capture)X | (capture)Y | (capture)Z )+ overall structure.
Wrapped around that structure is a /^(?:$regexSubstringOf{beam}|$regexSubstringOf{transport}\s* )\s*(?:$structure)\s+(?:to\s+)?$regexObjectName\s*$/i
And then it all ends up in an addCommand('transport', {crew=>$1,wound=>$2,crit=>$3},$4) if $cmd =~ /regex/i; ($4 is the ship name, captured by the $regexObjectName)
What I have done to work around the problem is to capture the whole pair, and then inside the addCommand() function, I fire off some more regex to s/\D//g the hash values if they are defined. I also have to add a negative lookahead in the captures to prevent '5 crit' from matching as a substring of 'crew': "5cr" and stomping the $1 value before backtracking kicks in.
To sum up; I want the numbers out of those pairs, with $1 = Number of healthy Crew, $2 = number of wounded, $3 = number of critically injured. How I get them is not important, and for multiple copies of them in the command string I don't care which one gets picked, although consistency is desirable and the last one is better than the first since that means a user can just keep typing if they make a mistake, instead of backspacing up to change the number.
| [reply] [d/l] [select] |
|
|
Well, how about this....?
#!/usr/bin/perl -w
use strict;
while (<DATA>)
{
print "testing: $_";
chomp;
my @pairs = m/(\d+)\s+(\w+)/g;
print "@pairs\n\n";
}
#Prints:
#testing: beam 15 crew 5 wounded 2 critical to S.S.Kevorkian
#15 crew 5 wounded 2 critical
#
#testing: oh, my gosh, darn 5 killed 2 want_sex_change 10 drunk
#5 killed 2 want_sex_change 10 drunk
#
#testing: what a day:5 wounded 2 critical 20 crew
#5 wounded 2 critical 20 crew
#
#testing: 20 crew and 6 killed and 14 MIA
#20 crew 6 killed 14 MIA
__DATA__
beam 15 crew 5 wounded 2 critical to S.S.Kevorkian
oh, my gosh, darn 5 killed 2 want_sex_change 10 drunk
what a day:5 wounded 2 critical 20 crew
20 crew and 6 killed and 14 MIA
| [reply] [d/l] |
|
|
|
|