What happens with empty $1 in regular expressions? (was: Regular Expression Question)

marcblecher has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expression Question by chromatic (Archbishop) on Feb 28, 2001 at 06:27 UTC
You can get rid of the second conditional. If the regex succeeds, $1 will be set to the captured element. If the regex fails, $1 will be undefined or whatever was in it previously. `my $sss = "M9"; if ($sss =~ m/(\d+)/){ print "Value Matched on: $1\n"; # prints Value Matched on: 9 } $sss = "Mm"; if ($sss =~ m/(\d+)/){ print "Value Matched on: $1\n"; # if statement is not entered } print "\$1 holds $1!\n";` [download] The generally accepted wisdom is to use conditionals exclusively if you're capturing things. You could also add an else clause, undefining $1.	[reply] [d/l]
Re: Regular Expression Question by ZZamboni (Curate) on Feb 28, 2001 at 06:32 UTC
The problem is that the $1, $2, etc. variables only get set if the regular expression matches. /(\d+)/ does NOT match "Mm", so no assignment occurs. What you need to do is to check whether there is a match before using the positional variables, like this: `if ($sss =~ /(\d+)/) { # use $1 as you wish } else { # don't! }` [download] Interestingly, your last code snippet will do the right thing if you remove the part after the `&&` (and an extra opening parenthesis you seem to have there), because $1 and $2 will only be used when there is a match, and in that case they will contain either the strings they matched, or undef if their respective subexpressions didn't match anything. --ZZamboni	[reply] [d/l]
Re: Re: Regular Expression Question by japhy (Canon) on Feb 28, 2001 at 19:28 UTC
No, `$1` will not be `undef` if the match fails. The only time they are undefined is when a match has not yet been done, or a match does not contain a captured pattern for that variable: `#!/usr/bin/perl -w if ("abc" =~ /(\w+)/, 1) { print "abc => $1\n" } { print "local: $1\n"; if ("ghi" =~ /\w+/, 1) { print "ghi => $1\n" } if ("def" =~ /(\w+)/, 1) { print "def => $1\n" } if ("[=]" =~ /(\w+)/, 1) { print "[=] => $1\n" } } print "general: $1\n"; __END__ abc => abc local: abc Use of uninitialized value at regexes line 6. ghi => def => def [=] => def general: abc` [download] However, a failed match returns false, so if you removed the `, 1`'s from each of those, you wouldn't see the line for "=". `japhy` -- Perl and Regex Hacker	[reply] [d/l]
Re: Re: Re: Regular Expression Question by ZZamboni (Curate) on Feb 28, 2001 at 19:40 UTC
I agree completely with you, and that's what I meant, that if the regex does not match, $1 is not modified in any way (neither set or unset). But your example made it way clearer than any explanation :-) --ZZamboni	[reply]
Re: Regular Expression Question by dsb (Chaplain) on Feb 28, 2001 at 21:41 UTC
What I did was a bit different. If you used the other examples and got a not match of the first match then $1 would still hold that value. If your goal is have $1 be 'undef' or to have an 'undef' value to play with, you may want to try this: `use strict; my $i = "mmmm9"; my $a = match_rtn( $i ); print $a, "\n"; $i = "mmm"; $a = match_rtn( $i ); print $a, "\n"; sub match_rtn { my $str = shift; $str =~ m/(\d+)/; return $1; }` [download] So there is a method now doing the checking and returning the value of $1. What is creating the 'undef' value though, is the use of 'strict'. From what I understand, using strict causes variables to be isolated to the block of code they are declared in, and when that block is finished the variable is destroyed. So $1 would be destroyed at the end of the routine after the value is returned. It works though, this way you definitely have an 'undef' value to play with if that is what you are after. I looked up exactly what 'strict' is supposed to do, and the Camel book says its supposed to disallow "unsafe" code. My question to anyone else is what is considered "unsafe"? Amel - f.k.a. - kel	[reply] [d/l]
Re: Re: Regular Expression Question by danger (Priest) on Feb 28, 2001 at 22:25 UTC
Your method of enclosing the match operation only appears to work (in terms of leaving $1 unmodified after the sub call, and returning undef on failure), but not at all for the reasons you provide. The use of 'strict' has nothing to do with producing the undef value, and 'strict' has nothing to do with how variables are scoped. Had any successful match been applied in the outer scope prior to your sub calls, then $1 (which is global) would have been set there and its value would be the return value on any of your sub calls that failed to match. Check this minor variation on your example: `use strict; "blah" =~ /(a)/; # now we've set $1 at the global scope my $i = "mmmm9"; my $a = match_rtn( $i ); print $a, "\n"; $i = "mmm"; $a = match_rtn( $i ); print $a, "\n"; # ook! this isn't undefined! sub match_rtn { my $str = shift; $str =~ m/(\d+)/; return $1; }` [download] Match variables ($1, $2, etc.) are global variables. When match variables are set (due to a successful match operation), they are always localized to the enclosing block. So, they retain their value until another successful pattern match, or the end of the current block. Witness: `{ $_ = 'blah'; /(a)/; print "$1\n"; # prints: a } print "$1\n"; # unitialized warning` [download] Now try this longer example and you'll see that the match variables are implicitly localized (ie, in the sense of local()): `$_ = 'blah'; /(\w)/; print "$1\n"; # prints: b /(\d)/; print "$1\n"; # still prints: b { /(a)/; print "$1\n"; # prints: a } print "$1\n"; # prints: b` [download] The proper way to protect yourself from using unintended old values in $1 and friends is to program defensively and check if a pattern match succeeded before trying to use captured subexpressions (as previous messages in this thread have shown). I looked up exactly what 'strict' is supposed to do, and the Camel book says its supposed to disallow "unsafe" code. My question to anyone else is what is considered "unsafe"? Please see the documentation for strict and tye's review of strict.pm for starters.	[reply] [d/l] [select]
Re: Re: Re: Regular Expression Question by merlyn (Sage) on Feb 28, 2001 at 22:30 UTC
The proper way to protect yourself from using unintended old values in $1 and friends is to program defensively and check if a pattern match succeeded before trying to use captured subexpressions (as previous messages in this thread have shown). Well, that's a way. Another way is to stylistically outlaw all uses of `$1` et seq, except in the right side of a substitution. Any other "capturing" should be done as list-context assignment: `my ($first, $second) = $source =~ /blah(this)blah(that)blah/;` [download] Then it's very clear what the scope and origination of `$first` and `$second` are. -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Re: Re: Re: Regular Expression Question by danger (Priest) on Feb 28, 2001 at 23:29 UTC
Re: Re: Regular Expression Question by chipmunk (Parson) on Mar 01, 2001 at 00:40 UTC
I looked up exactly what 'strict' is supposed to do, and the Camel book says its supposed to disallow "unsafe" code. My question to anyone else is what is considered "unsafe"? I don't know whether you're looking at Camel II or III, but they both provide full documentation on strict. The Second edition documents the strict pragma beginning on page 500, and the Third edition beginning on page 858. Of course, the Camel book is not the only source of documentation. Each core module and pragma also comes with built-in documentation. You can read the standard documentation for strict with `perldoc strict` (or the equivalent on Windows [the HTML-ized docs] or Mac [Shuck]). The docs from 5.005_02 are even available on this site, including the docs for strict. To answer your question, three things are considered 'unsafe'. Each one is controlled by a separate part of strict. strict 'refs' prevents the use of symbolic references. strict 'vars' prevents the use of variables which are not pre-declared or fully qualified. strict 'subs' prevents the use of barewords.	[reply] [d/l]
Re: Re: Re: Regular Expression Question by dsb (Chaplain) on Mar 01, 2001 at 00:47 UTC
That much I got(didn't mean to sound like a dkhead). What I don't get is why that is unsafe. Is it because bareword could be confused with a built in function or something like that? Why are undeclared variables unsafe? Same kind of thing? Excuse my ignorance. I am starting to really get interested in the theories of programming and things like that and I would really like to get a handle on this stuff. I taught myself Perl about a year ago and I've had no one to ask these questions to. They are all sort of flooding out now. Thanks for your help. Amel - f.k.a.** - kel	[reply]
strict; why these practices are unsafe by chipmunk (Parson) on Mar 01, 2001 at 01:05 UTC
(tye)Re: Regular Expression Question by tye (Sage) on Mar 01, 2001 at 00:57 UTC