Rudif has asked for the wisdom of the Perl Monks concerning the following question:

The other day I had the following problem: scan a bunch of idl files that look, simplified, like __DATA__ below, to identify each interface or dispinterface declaration, and see whether it has the attribute 'hidden' present, absent or commented out.

I ended up with the solution shown below, after much head scratching and searching through Perl doc and books. In my early attempts I did not have the two inner blocks - all matches were done in the for loop block.

Naively I expected the $1 variable to be reinitialized in each attempted match. Not so. When the first match succeeded (found 'interface Isome') and the second failed (did not find 'hidden'), $1 still contained the 'interface' string from the first match, while I wanted it to be 'undef'.

Eventually I found these two perls of wisdom that saved my day:

perlre
The numbered variables ($1, $2, $3, etc.) and the related punctuation set (<$+, $&, $`, and $') are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See Compound Statements in the perlsyn manpage.)


Effective Perl Programming, item 16
Memory variables are automatically localized by each new scope. In a unique twist, the localized variables receive copies of the values from the outer scope - this is in contrast to the usual reinitializing of a localized variable


In addition, I realized that a 'next' without a label takes me out of the immediate enclosing block, but it needs a label, OUTER:, to get me out of the for loop.

Moral for a regex user:
if you have several successive match operations that use the numbered variables, the safe course is to isolate the matches in separate blocks on the same level (isolating a match in an inner block from a match in an outer block would not work because of that 'unique twist').

Questions to the wise:
Why do the localized variables receive copies of the values from the outer scope? Why are the variables not simply undef'd, like in the regular local operation? Since these variables are read-only, there is no way to undef them and thus erase the memory of a previous match.
Are there other tricks or techniques to get around this problem?

Rudif

#! perl -w use strict; my $text; { $/ = undef; $text = <DATA>; } my @f = split /(?<=\n)([ \t]*[\[\]][ \t]*\n)/s, $text; OUTER: for my $i (0..$#f-2) { { # inner block 1 next OUTER unless $f[$i] =~ /^([ \t]*[\[][ \t]*)\n$/ && $f[$i+2] =~ /^([ \t]*[\]][ \t]*)\n$/ && $f[$i+3] =~ /((?:disp)?interface\s+\w+)/; print STDERR "==$i== $1\n"; } { # inner block 2 $f[$i+1] =~ /(?<=\n)((\s*[\/]*\s*not)?\s+hidden)/; my $h = defined $1 ? $1 : ' HIDDEN UNDEF'; print STDERR "==$i== $h\n"; } } __DATA__ [ uuid(078F04FD-B23E-11D3-80C3-00A024D42DAF), // not hidden ] dispinterface _IMgrEvents { }; [ object, uuid(078F04EB-B23E-11D3-80C3-00A024D42DAF), hidden ] interface IMgr : IDispatch { [propget, id(201), HRESULT DebugInfo([out, retval] BSTR *pVal) +; }; [ uuid(078F04FE-B23E-11D3-80C3-00A024D42DAF), ] dispinterface _IMgrEvents { }; [ object, uuid(078F04EC-B23E-11D3-80C3-00A024D42DAF), ] interface IMgr : IDispatch { [propget, id(201), HRESULT DebugInfo([out, retval] BSTR *pVal) +; };

Replies are listed 'Best First'.
Re: Scoping the regex memory variables and where do I go next
by japhy (Canon) on Mar 01, 2001 at 05:07 UTC
    What really irks me is that the only way to undefine them (locally) is to do something like:
    "jeffrey" =~ /(.*?)(.)\2(.*)/; print "$1 + $2 + $2 + $3 = $1$2$2$3\n"; { local ($1,$2,$3); print "$1 + $2 + $2 + $3 = $1$2$2$3\n"; "" =~ /(?=)/; print "$1 + $2 + $2 + $3 = $1$2$2$3\n"; }
    The output is:
    je + f + f + rey = jeffrey je + f + f + rey = jeffrey + + + =
    Sigh. And you can't do a blanket localization -- that's the worst part; I have to know how many $DIGIT variables are in use, so I know how many to localize. That sucks.

    japhy -- Perl and Regex Hacker
Re: Scoping the regex memory variables and where do I go next
by tadman (Prior) on Mar 01, 2001 at 04:58 UTC
    If you are asking "Why do the localized variables receive copies of the values from the outer scope?" I would suggest it is because of common situations like:
    if (/\$(\d+\.\d+)/) { if ($1 > 1) { $donut_price = $1; } }
    So, if $1 mysteriously "disappeared" by the time you tried to make an assignment, what use would $1 be?

    Your use of a split(), however, is a scary sight to behold. You would probably be better off using a regexp on its own, and checking if the regexp passes or fails each time. This would avoid using $1, and would avoid the scoping problem. Use while{} instead of split.

    Avoiding $1 is easy if you assign to an array:        ($foo) = /...(\d+).../; If the match passes, but some of your memorizations are conditional (i.e. /()?/), then some of the array members might be undef, which is what you'd expect. It is still a true assignment, though, so the while will proceed as planned.
Re: Scoping the regex memory variables and where do I go next
by Rudif (Hermit) on Mar 01, 2001 at 05:10 UTC
    Just a few hours ago, merlyn answered my question "Are there other tricks or techniques to get around this problem?" in another thread :

    Another way is to stylistically outlaw all uses of $1 et seq, except in the right side of a substitution. Any other "capturing" should be done as list-context assignment:
    my ($first, $second) = $source =~ /blah(this)blah(that)blah/;
    Then it's very clear what the scope and origination of $first and $second are.


    But I'd still like to know: is that 'unique twist' a feature? merlyn?
    Thanks!

    Rudif
      It's a documented feature of m//. Also, the $DIGIT variables are read-only, so storing their values in another variable allows you to modify them.

      japhy -- Perl and Regex Hacker
Re: Scoping the regex memory variables and where do I go next
by chipmunk (Parson) on Mar 01, 2001 at 08:38 UTC
    There's a very simple rule, which you are violating.

    Don't rely on the values of $1 et al. if you don't know that the regex matched.

    Here's a very simple example:

    /(a)/; print "matched $1\n"; # WRONG! if (/(a)/) { print "matched $1\n"; # Right }
Re: Scoping the regex memory variables and where do I go next
by Rudif (Hermit) on Mar 02, 2001 at 03:34 UTC
    Thank you all for the words of wisdom.

    As tadman points out, it makes sense that variables $1 etc keep their values in inner scopes - I might need them! However, if local($1,$2,$3) did exactly what local($x,$y,$z) does, namely save the outer scope values and undefine the localized ones, the behavior would be simpler to understand and to control. <a href="http://www.perlmonks.org/index.pl?node_id=61482&lastnode_id=61478>japhy would pperhaps agree.



    <a href="http://www.perlmonks.org/index.pl?node_id=61482&lastnode_id=61478>japhy would probably agree.

    Lessons learned:
    • Don't rely on the values of $1 etc except in very simple situations (e.g. a single match in the scope)
    • Save contents of $1 etc in own named variables, possibly using the list assignment
    Your use of a split(), however, is a scary sight to behold, tadman writes.
    OK, it may be inefficient, but it separates the concerns neatly. Split produces a list in which separator-strings and blocks-between-separators alternate. I can confidently walk down the list and examine and/or modify each block without worrying at the same time about where the block ends. Assuming, of course, that the regex used in split correctly identifies all separator-strings in the text.

    Rudif