There is a very common mistake made among newcomers to Perl and Perl's Regular Expressions. That is to trust the value of special matching variables such as $1 without verifying that a match succeeded.

While this isn't much of a logical leap; to realize that in order for $1 to contain useful data a match must have occurred, many beginners just don't think of it, and that oversight leads to warnings and sometimes difficult to locate bugs in their code.

The Perl PODs provide several documents that deal with Regular Expressions. The primary documents, in order of how they should probably be read by a beginner are: perlrequick, perlretut, and perlre.

What surprised me when I recently re-read these docs is that none of those three primary regex documents warn against using $1 without checking for a match. The closest thing I could find was in perlre,

The numbered variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, $', and $^N) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See perlsyn/"Compound Statements".)

perlvar is only slightly more helpful with,

$<digit> Contains the subpattern from the corresponding set of capturing parentheses from the last pattern match, not counting patterns matched in nested blocks that have been exited already. (Mnemonic: like \digits.) These variables are all read-only and dynamically scoped to the current BLOCK.

Again, someone with a little experience is going to read between the lines and know that if there isn't a match, there's nothing to see in $1. The hints are all there. But it might be more explicit if one of the aformentioned docs simply stated, "The value of the capturing variables is undef in the event that there is no match, or the value of the most recent capture resulting from a match."

And then there's perlfaq6: Silent on the issue.

This being an issue that manifests itself with great regularity, I'm surprised that there isn't any mention that if a match fails the $<digit> variables will be undefined. And taking it one step further, it would be nice if somewhere in the docs mention were made that one should test for a match before relying on the $<digit> variables.

Perl gives people enough rope to tie impressive knots, or to hang themselves. And the docs can't enumerate every possible way in which people can hang themselves. But this happens to be a pretty common noose.

Update: Ok, this isn't the place to submit a patch, but I'd appreciate comments on a proposed patch to perlre; the addition of the following text:

The value of the special capturing variables will be undef in the event of no match within current scope, or the value of the most recent successful match's capture in the current scope even if there has been a subsequent failed match.

Is this accurate? Clear? I know that this isn't the place to submit a patch, but before I do submit one, I'd like a weigh-in on what might constitute clear and accurate verbiage.


Dave

Replies are listed 'Best First'.
Re: Perl's POD's description of the use of capturing special variables.
by tinita (Parson) on Apr 06, 2004 at 10:47 UTC
    i think it is probably useful to add a short code example. i don't know but code examples in the pods are likely to be taken as they are, aren't they?
    that code examples can last very long (and even longer than wanted) is shown by the cgi parsing code snippet that has been hanging around the net forever...
    so one should provide a good example, something like:
    my $value; if ( m/regex (with parens) .../) { # do something with $1 $value = $1; } else { # don't use $1; exception, or provide a default value $value = 'default'; } or maybe: my $value = m/...(..).../ ? $1 : 'default';
Re: Perl's POD's description of the use of capturing special variables.
by jeffa (Bishop) on Apr 06, 2004 at 13:44 UTC
    MU. Don't use $1 etc. when you don't have to:
    use strict; no warnings; my @data = (1,2,3,undef,4,5); for (@data) { /(\d+)/; print $1 if $1; } print "\n"; for (@data) { my ($numb) = $_ =~ /(\d+)/; print $numb if $numb; } print "\n";
    I would like to see more of the second example in the docs. :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      I think you meant
      /(\d+)/ and print $1; # or print $1 if /(\d+)/; # and print $numb if defined $numb; # since 0 is false
      :)

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Perl's POD's description of the use of capturing special variables.
by Abigail-II (Bishop) on Apr 06, 2004 at 10:50 UTC
    What's unclear about the following in man perlre:
       NOTE: failed matches in Perl do not reset the match vari-
       ables, which makes easier to write code that tests for a
       series of more specific cases and remembers the best
       match.
    

    Abigail

Re: Perl's POD's description of the use of capturing special variables.
by liz (Monsignor) on Apr 06, 2004 at 10:08 UTC
    Patches are welcome, you know!

    Liz

    Update:
    Root node got changed after I made this remark. How I hate Hidden updates ;-(.

      I follow a rule I think I saw propounded by merlyn: within the first several minutes of posting, I update without extra comment. Looks like that's what happened with you and davido. If it bothers you that much, perhaps you should delay in responding to very new posts.
Re: Perl's POD's description of the use of capturing special variables.
by rinceWind (Monsignor) on Apr 06, 2004 at 10:40 UTC
    The value of the special capturing variables will be undef in the event of no match within current scope, or the value of the most recent successful match's capture in the current scope even if there has been a subsequent failed match.
    I don't think that this is particularly clear. After a failed match, will $1 be undef or previous successful match? What effect does {} scope have on this?

    --
    I'm Not Just Another Perl Hacker

Re: Perl's POD's description of the use of capturing special variables.
by Vautrin (Hermit) on Apr 06, 2004 at 13:05 UTC
    While I do agree that any improvement to the documentation is good, I'd like to point out that every computer language has a number of foibles that are generally only learned by a programmer when he or she sits down and writes a test case. To give you a good example, eval blocks can be used to catch an exception (although it's something of a kludge to stop premature exiting of the program). It would seem logical that you could use an eval block to stop the "No Module in @INC" error and check whether or not you've got a Module, but you can't (Perl will exit anyways). Things like this are never learned by reading a FAQ or a book, but as a result of experience.

    Want to support the EFF and FSF by buying cool stuff? Click here.
Re: Perl's POD's description of the use of capturing special variables.
by marmot (Novice) on Apr 06, 2004 at 15:37 UTC
    It's not especially clear and it's an easy mistake to make. I've been using Perl for years and consider myself quite experienced with it, but this exact issue got me a couple of months ago. I had to read the text rendering of a bunch of records, which might or might not contain all fields from the database.

    while (<TEXT>) { ($f1) = /yada.+(SpecialText)anchor/ }
    worked fine when it worked, but sometimes the printout listed odd values. It was a real thump-yourself-on-the-forehead moment when I realized what was happening.

    I think a quick example in perlre would help a lot of newbies, and even the occasional old-timer.

    Dave

Re: Perl's POD's description of the use of capturing special variables.
by Not_a_Number (Prior) on Apr 06, 2004 at 22:16 UTC
    The value of the special capturing variables will be undef in the event of no match within current scope, or the value of the most recent successful match's capture in the current scope even if there has been a subsequent failed match.

    As others have pointed out, the wording here is far from optimal. My problem, however, is that it does not explain the following (but neither does the existing POD, as far as I understand it, so don't take this personally ;):

    while ( <DATA> ) { /^([A-Z]+)$/ or print "Line $.: No match: $1\n"; } __DATA__ FOO 1234 XYz

    Output:

    Line 2: No match: FOO Line 3: No match: F

    While for Line 2, $1 is, as expected, "the value of the most recent successful match's capture", why oh why is $1 not still 'FOO' (or undef) for Line 3??

    dave