marcblecher has asked for the wisdom of the Perl Monks concerning the following question:

Edited by mirod: changed the title to "What happens with empty $1 in regular expressions?"

I have a question about what happens to $1 if it matches neither the empty string nor anything else as in:

$sss = "M9"; $sss =~ m/(\d+)/; print $1; # --> prints 9 $sss = "Mm"; $sss =~ m/(\d+)/; print $1; #--> still prints 9
$1 in this case will still equal 9 rather than undef as would be expected. The following code protects against $1 not resetting from previous matches, but it requires two tests. Does anybody know of any good way to insure $1 resets between matches where we could avoid having to use two tests to determine the status of a match?
$sss = "M9"; if ($sss =~ m/(\d+)/ && $1 ne undef){ print "Value Matched on: $1"; # prints Value Matched on: 9 } $sss = "Mm"; if ($sss =~ m/(\d+)/ && $1 ne undef){ print "Value Matched on: $1"; # if statement is not entered +; prints nothing }
Also, this solution only works if I'm only trying to match on one thing at a time but not if I try to match on multiple items as in the next code snipped:
$sss = "9 9"; if ($sss =~ m/(([A-Za-z]*)(\d*)/ && $1 ne undef){ print "Value Matched on: $1"; # if statement is not entered +; prints nothing }
If the if statement evaluates to true (as it would in this case), there is no way for me to tell whether that happened because of the first or because of the second match.

Replies are listed 'Best First'.
Re: Regular Expression Question
by chromatic (Archbishop) on Feb 28, 2001 at 06:27 UTC
    You can get rid of the second conditional. If the regex succeeds, $1 will be set to the captured element. If the regex fails, $1 will be undefined or whatever was in it previously.
    my $sss = "M9"; if ($sss =~ m/(\d+)/){ print "Value Matched on: $1\n"; # prints Value Matched on: 9 } $sss = "Mm"; if ($sss =~ m/(\d+)/){ print "Value Matched on: $1\n"; # if statement is not entered } print "\$1 holds $1!\n";

    The generally accepted wisdom is to use conditionals exclusively if you're capturing things. You could also add an else clause, undefining $1.

Re: Regular Expression Question
by ZZamboni (Curate) on Feb 28, 2001 at 06:32 UTC
    The problem is that the $1, $2, etc. variables only get set if the regular expression matches. /(\d+)/ does NOT match "Mm", so no assignment occurs. What you need to do is to check whether there is a match before using the positional variables, like this:
    if ($sss =~ /(\d+)/) { # use $1 as you wish } else { # don't! }
    Interestingly, your last code snippet will do the right thing if you remove the part after the && (and an extra opening parenthesis you seem to have there), because $1 and $2 will only be used when there is a match, and in that case they will contain either the strings they matched, or undef if their respective subexpressions didn't match anything.

    --ZZamboni

      No, $1 will not be undef if the match fails. The only time they are undefined is when a match has not yet been done, or a match does not contain a captured pattern for that variable:
      #!/usr/bin/perl -w if ("abc" =~ /(\w+)/, 1) { print "abc => $1\n" } { print "local: $1\n"; if ("ghi" =~ /\w+/, 1) { print "ghi => $1\n" } if ("def" =~ /(\w+)/, 1) { print "def => $1\n" } if ("[=]" =~ /(\w+)/, 1) { print "[=] => $1\n" } } print "general: $1\n"; __END__ abc => abc local: abc Use of uninitialized value at regexes line 6. ghi => def => def [=] => def general: abc
      However, a failed match returns false, so if you removed the , 1's from each of those, you wouldn't see the line for "=".

      japhy -- Perl and Regex Hacker
        I agree completely with you, and that's what I meant, that if the regex does not match, $1 is not modified in any way (neither set or unset). But your example made it way clearer than any explanation :-)

        --ZZamboni

Re: Regular Expression Question
by dsb (Chaplain) on Feb 28, 2001 at 21:41 UTC
    What I did was a bit different. If you used the other examples and got a not match of the first match then $1 would still hold that value. If your goal is have $1 be 'undef' or to have an 'undef' value to play with, you may want to try this:
    use strict; my $i = "mmmm9"; my $a = match_rtn( $i ); print $a, "\n"; $i = "mmm"; $a = match_rtn( $i ); print $a, "\n"; sub match_rtn { my $str = shift; $str =~ m/(\d+)/; return $1; }
    So there is a method now doing the checking and returning the value of $1. What is creating the 'undef' value though, is the use of 'strict'. From what I understand, using strict causes variables to be isolated to the block of code they are declared in, and when that block is finished the variable is destroyed. So $1 would be destroyed at the end of the routine after the value is returned. It works though, this way you definitely have an 'undef' value to play with if that is what you are after.

    I looked up exactly what 'strict' is supposed to do, and the Camel book says its supposed to disallow "unsafe" code. My question to anyone else is what is considered "unsafe"?

    Amel - f.k.a. - kel

      Your method of enclosing the match operation only *appears* to work (in terms of leaving $1 unmodified after the sub call, and returning undef on failure), but not at all for the reasons you provide. The use of 'strict' has nothing to do with producing the undef value, and 'strict' has nothing to do with how variables are scoped. Had any successful match been applied in the outer scope prior to your sub calls, then $1 (which is global) would have been set there and its value would be the return value on any of your sub calls that failed to match. Check this minor variation on your example:

      use strict; "blah" =~ /(a)/; # now we've set $1 at the global scope my $i = "mmmm9"; my $a = match_rtn( $i ); print $a, "\n"; $i = "mmm"; $a = match_rtn( $i ); print $a, "\n"; # ook! this isn't undefined! sub match_rtn { my $str = shift; $str =~ m/(\d+)/; return $1; }

      Match variables ($1, $2, etc.) are global variables. When match variables are set (due to a successful match operation), they are always localized to the enclosing block. So, they retain their value until another successful pattern match, or the end of the current block. Witness:

      { $_ = 'blah'; /(a)/; print "$1\n"; # prints: a } print "$1\n"; # unitialized warning

      Now try this longer example and you'll see that the match variables are implicitly localized (ie, in the sense of local()):

      $_ = 'blah'; /(\w)/; print "$1\n"; # prints: b /(\d)/; print "$1\n"; # still prints: b { /(a)/; print "$1\n"; # prints: a } print "$1\n"; # prints: b

      The proper way to protect yourself from using unintended old values in $1 and friends is to program defensively and check if a pattern match succeeded before trying to use captured subexpressions (as previous messages in this thread have shown).

      I looked up exactly what 'strict' is supposed to do, and the Camel book says its supposed to disallow "unsafe" code. My question to anyone else is what is considered "unsafe"?

      Please see the documentation for strict and tye's review of strict.pm for starters.

        The proper way to protect yourself from using unintended old values in $1 and friends is to program defensively and check if a pattern match succeeded before trying to use captured subexpressions (as previous messages in this thread have shown).
        Well, that's a way. Another way is to stylistically outlaw all uses of $1 et seq, except in the right side of a substitution. Any other "capturing" should be done as list-context assignment:
        my ($first, $second) = $source =~ /blah(this)blah(that)blah/;
        Then it's very clear what the scope and origination of $first and $second are.

        -- Randal L. Schwartz, Perl hacker

      I looked up exactly what 'strict' is supposed to do, and the Camel book says its supposed to disallow "unsafe" code. My question to anyone else is what is considered "unsafe"?

      I don't know whether you're looking at Camel II or III, but they both provide full documentation on strict. The Second edition documents the strict pragma beginning on page 500, and the Third edition beginning on page 858.

      Of course, the Camel book is not the only source of documentation. Each core module and pragma also comes with built-in documentation. You can read the standard documentation for strict with perldoc strict (or the equivalent on Windows [the HTML-ized docs] or Mac [Shuck]).

      The docs from 5.005_02 are even available on this site, including the docs for strict.

      To answer your question, three things are considered 'unsafe'. Each one is controlled by a separate part of strict. strict 'refs' prevents the use of symbolic references. strict 'vars' prevents the use of variables which are not pre-declared or fully qualified. strict 'subs' prevents the use of barewords.

        That much I got(didn't mean to sound like a d**khead). What I don't get is why that is unsafe. Is it because bareword could be confused with a built in function or something like that? Why are undeclared variables unsafe? Same kind of thing?

        Excuse my ignorance. I am starting to really get interested in the theories of programming and things like that and I would really like to get a handle on this stuff. I taught myself Perl about a year ago and I've had no one to ask these questions to. They are all sort of flooding out now.

        Thanks for your help.

        Amel - f.k.a. - kel