perlguy has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks:

While working on a quick script to parse some files for our site, I came across something peculiar.

In the code below, I'm setting my hash %date to '' in each of the three elements: YEAR, MONTH, and DATE. I then use embedded code in a regular expression to populate those fields.

On the first pass, everything comes out as expected, i.e. %date is populated correctly. But on every subsequent pass, %date is populated inside of the regular expression (as demonstrated), but upon leaving the regular expression, it seems to clear itself.

When the @file_names are rearranged, there is no different effect. It still works on the first iteration, and the remaining two do not. Any thoughts? I'm sure I'm overlooking something.

use strict; use warnings; use Data::Dumper; my @file_names = qw( /a/990101ag/toc.html /a/20000115/toc.html /a/990115ag/toc.html ); for my $file_name ( @file_names ) { print "Now working on $file_name\n"; my %date = ( YEAR => '', MONTH => '', DAY => '' ); $file_name =~ m# (?<=/a/) (\d{2,4})(\d{2})(\d{2})(?:ag)? (?=/) (?{ @date{'YEAR', 'MONTH', 'DAY'} = ( ( length($1) == 2 ? '19' . $1 : $1), $2, $3 ) }) (?{ print "Here, \%date is populated (joined): ", join('-', @date{'YEAR', 'MONTH', 'DAY'}), "\n" }) #x; print "But after the first iteration, ", "it isn't populated here (joined): ", join('-', @date{'YEAR', 'MONTH', 'DAY'}), "\n"; print Dumper(\%date); print "\n\n"; } print "Why????\n";

Replies are listed 'Best First'.
Re: regex code embedding problem?
by TimToady (Parson) on Mar 13, 2004 at 09:25 UTC
    I traced the problem down to the point of finding that it was the pp_unstack code at the end of the loop that is breaking the closureness of the regex block. Here's a minimal test case using the regex:
    for my $str ( 1..3 ) { my $lex = ""; $str =~ m# (\d) (?{ $lex = $1; print " In: $lex\n" }) #x; print "Out: $lex\n"; }
    But it turns out you don't even need a regex to trigger the bug. Here's the same thing with a regular closure:
    for $i ( 1..3 ) { my $lex = ""; $code = sub { $lex = "Test $i"; print " In: $lex\n"; } unless $code; $code->(); print "Out: $lex\n"; }
    The basic underlying problem is that when a lexical variable goes out of scope, the unstack code tries to "null out" any lexicals in its scope. But it can't do that if someone's trying to return the lexical as a result. In that case, it cuts it loose and cooks up a new lexical. Unfortunately, that isn't the case here, but it thinks it is, so it cuts it loose from the inner scope rather than from any outer scope.

    Ordinarily closures don't run into this problem because they reclone their symbol table each time through the loop. But as you see in the last example, if we suppress the recloning with an unless, we get the same bug.

    Actually, it's arguably not a bug in the second case, because we've taken a closure to the first time through the loop, and if the bindings refer to the my variables the first time through, they can't also refer to the same variables in the other iterations, presuming them to be "different" my variables.

    So probably what needs to happen is that when the sv_compile_2op() routine is compiling the insides of (?{...}), it needs to allocate and save a real CV rather than just the opcodes, so that when the regex engine gets around to running the closure, it can actually be made a closure by calling cv_clone().

    Oh, and my mail server is down right now, so if someone could forward this to perlbug, I'd be much obliged.

Re: regex code embedding problem?
by kvale (Monsignor) on Mar 12, 2004 at 23:26 UTC
    Well, I can get it to work by bringing the lexical scope outside the loop:
    use strict; use warnings; use Data::Dumper; my @file_names = qw( /a/990101ag/toc.html /a/20000115/toc.html /a/990115ag/toc.html ); my %date; for my $file_name ( @file_names ) { print "Now working on $file_name\n"; %date = ( YEAR => '', MONTH => '', DAY => '' ); $file_name =~ m# (?<=/a/) (\d{2,4})(\d{2})(\d{2})(?:ag)? (?=/) (?{ @date{'YEAR', 'MONTH', 'DAY'} = ( ( length($1) == 2 ? '19' . $1 : $1), $2, $3 ) }) (?{ print "Here, \%date is populated (joined): ", join('-', @date{'YEAR', 'MONTH', 'DAY'}), "\n" }) #x; print "But after the first iteration, ", "it isn't populated here (joined): ", join('-', @date{'YEAR', 'MONTH', 'DAY'}), "\n"; print Dumper(\%date); print "\n\n"; }
    This looks like some unusual interaction between lexicals and the regex engine.

    Update: cleaned up an initialization.

    -Mark

      Thanks, kvale,

      I'm going to agree that this is something strange with lexicals and the regular expression engine. I tried bringing the lexical outside of the loop with success, as you suggested, but I thought the result was quite interesting upon putting it in the same scope as the regular expression. A mystery to me...

Re: regex code embedding problem?
by TilRMan (Friar) on Mar 13, 2004 at 03:07 UTC

    My best guess is that when your regexp gets compiled -- the first time it is run -- the "first" %date gets compiled in. Perl thinks that since the compiled regexp hasn't changed, it can just rerun the regexp, but instead it populates the wrong %date.

    Put the $file_name =~ ... inside an eval "" and it'll do what you expect it to do.

    -- 
    LP^>

Re: regex code embedding problem?
by matija (Priest) on Mar 12, 2004 at 23:28 UTC
    I'm not quite sure why you feel you need to execute the code inside the regex. perldoc perlre says that code is highly experimental.

    I'd do it like this:

    if ($file_name =~ m# (?<=/a/) (\d{2,4})(\d{2})(\d{2})(?:ag)?#x) { @date{'YEAR', 'MONTH', 'DAY'} = (( length($1) == 2 ? '19' . $1 : $1), $2, $3 ),$2,$3); }
    Am I overlooking some reason why that wouldn't work for you?

      Well, for one, I wasn't seeking a true/false value from the regular expression; I'm simply using the engine to do some dirty work. The question I have really hasn't been answered even if the code you posted works. I'd like to know why the hash is cleared upon exiting the regex, when it is quite obviously (or so I think) populated upon successful matching of the lookbehind, matching pattern, and then lookahead.

      I know that embedded code in a regex is experimental. The code I'm writing is write once, execute once, and then never look at again. I generally don't write code like this, especially not in production, but when I came across what I felt was something strange, I simplified the code down to what I felt to be the simplest case that generated the problem, and then asked about it.