bytex64 has asked for the wisdom of the Perl Monks concerning the following question:

I've got a rather... bizarre problem involving regex embedded code. I have a program that finds files, and matches them to different rules, which do various things to the files. The regexes are specified with a simple syntax and expanded into full-blown regexes later. The regex is designed to not only match the filenames, but also to store important parts of the string for later (for use in constructing an output filename, for example). Here's the important bit of code:

#!/usr/bin/perl -w use strict; use re 'eval'; my @strings = ('test_001022','test_585381','test_389742','test_330104' +); my $capture = 'test_%a%b'; my %matchers = ('a' => '(\d{2})(?{ $mv{a} = $^N })', 'b' => '(\d{4})(?{ $mv{b} = $^N })' ); sub addmatchers { my ($regex) = @_; $regex =~ s/%(.)/$matchers{$1}/g; return $regex; } sub domatch { my ($str, $regex) = @_; my %mv; if ($str =~ /$regex/) { if (keys %mv == 0) { print "No keys in hash.\n"; return 0; } else { return \%mv; } } else { return 0; } } print "original pseudo-regex: $capture\n"; my $r = addmatchers($capture); print "modified regex: $r\n\n"; foreach my $str (@strings) { print "Matching on $str\n"; my $result = domatch($str,$r); if ($result) { print "Result:\n"; foreach my $k (keys %{$result}) { print "\t$k => $result->{$k}\n"; } } else { print "No match.\n"; } }

The addmatchers() function turns $capture into a regex by substituting in specialized matchers that match a bit of text and store it in a hash. This regex is then run with domatch(), matches are stored in the local %mv hash, and the resulting hash is returned if the regex matches.

The problem is that only the first match seems to work. For subsequent strings, the regex matches successfully, but the hash comes back empty. Here's an example run:

original pseudo-regex: test_%a%b modified regex: test_(\d{2})(?{ $mv{a} = $^N })(\d{4})(?{ $mv{b} = $^N + }) Matching on test_001022 Result: a => 00 b => 1022 Matching on test_585381 No keys in hash. No match. Matching on test_389742 No keys in hash. No match. Matching on test_330104 No keys in hash. No match.

My first thought was "I've found a bug in perl!", but now I think there's some bizarre regex backtrack scoping... thing... going on. I've tried this with Perl versions 5.8.4 (Slackware 10.0), 5.8.6 (Fedora Core 4), and 5.8.7 (Debian unstable) with identical results.

Replies are listed 'Best First'.
Re: regex eval capture weirdness
by Errto (Vicar) on Nov 30, 2005 at 23:31 UTC

    Yes, this appears to be a scoping problem. I'm not fully versed in what is going on, but based on similar issues I've seen in the past, by safest suggestion is to turn %mv into a global (our) var, localize it on entry to domatch, and then return a reference to a copy of it like return { %mv }.

    I think what's happening is that because the regular expression doesn't change, it only gets compiled once and thus the %mv within those ?{} blocks gets bound to the lexical %mv that's created the first time domatch gets called. To see if I'm right, try saving the reference to the return value from domatch only the first time it's called but printing the contents of it after each iteration.

    If I'm right, this is somehow related to the "won't stay shared" warning though I admit I don't have a solid grasp on why.

Re: regex eval capture weirdness
by injunjoel (Priest) on Nov 30, 2005 at 23:37 UTC
    Greetings,
    Just some updates to your code. I declared %mv in the main scope as "our" so it can be localized in your &domatch() sub later on. The code below works fine for me.
    #!/usr/bin/perl -w use strict; use re 'eval'; my @strings = ('test_001022','test_585381','test_389742','test_330104' +); my $capture = 'test_%a%b'; #added our to your %mv declaration so it can be localized in &domatch( +); our %mv; my %matchers = ('a' => '(\d{2})(?{ $mv{a} = $^N })', 'b' => '(\d{4})(?{ $mv{b} = $^N })' ); sub addmatchers { my ($regex) = @_; $regex =~ s/%(.)/$matchers{$1}/g; return $regex; } sub domatch { my ($str, $regex) = @_; #localize %mv local %mv; if ($str =~ /$regex/) { if (keys %mv == 0) { print "No keys in hash.\n"; return 0; } else { return \%mv; } } else { return 0; } } print "original pseudo-regex: $capture\n"; my $r = addmatchers($capture); print "modified regex: $r\n\n"; foreach my $str (@strings) { print "Matching on $str\n"; my $result = domatch($str,$r); if ($result) { print "Result:\n"; foreach my $k (keys %{$result}) { print "\t$k => $result->{$k}\n"; } } else { print "No match. ($r)\n"; } }

    Here is the output I get
    original pseudo-regex: test_%a%b modified regex: test_(\d{2})(?{ $mv{a} = $^N })(\d{4})(?{ $mv{b} = $^N + }) Matching on test_001022 Result: a => 00 b => 1022 Matching on test_585381 Result: a => 58 b => 5381 Matching on test_389742 Result: a => 38 b => 9742 Matching on test_330104 Result: a => 33 b => 0104


    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
Re: regex eval capture weirdness
by GrandFather (Saint) on Nov 30, 2005 at 23:39 UTC

    It's a closure problem (which I don't understand well). The fix looks like:

    sub domatch { my ($str, $regex) = @_; our %mv; # Note the our return 0 if $str !~ /$regex/; return \%mv if keys %mv != 0; print "No keys in hash.\n"; return 0; }

    Which in the context of your sample code prints:

    original pseudo-regex: test_%a%b modified regex: test_(\d{2})(?{$mv{a} = $^N })(\d{4})(?{$mv{b} = $^N } +) Matching on test_001022 Result: a => 00 b => 1022 Matching on test_585381 Result: a => 58 b => 5381 Matching on test_389742 Result: a => 38 b => 9742 Matching on test_330104 Result: a => 33 b => 0104

    Neat trick with the (?{...$^N}) BTW.


    DWIM is Perl's answer to Gödel

      GrandFather, minor quibble. Take the output section and change it to:

      my @results = map { [$_ => domatch($_, $r)] } @strings; #foreach my $str (@strings) { foreach my $r (@results) { my ($str, $result) = @$r; print "Matching on $str\n"; #my $result = domatch($str,$r); if ($result) { print "Result:\n"; foreach my $k (keys %{$result}) { print "\t$k => $result->{$k}\n"; } } else { print "No match.\n"; } }
      And notice it stop working. Well, notice it give the last result for all the matches. Now change your code to return { %mv } if keys %mv != 0 and notice it start working again.

      It is bizarre ... and I'm hoping someone more familiar with the regexp engine internals will pipe up as to why this isn't DWIMming very well. I tried a number of different changes to try to get the engine to think I was doing another match, but failed, perhaps someone else will see what we're missing. In this way, I wouldn't (yet) call what you did a "fix" but more of a "workaround" ;-)

        Yes, "fix" was overstating the case somewhat in light of the level of understanding involved.

        Take a look the this discussion. It's the same problem. The issue is that the compilation of the (?{...}) in the regex creates a closure on %mv and the second time through the sub the my %mv; creates a different instance of %mv which is not the one the regex is using.

        Now I've told the teddy bear I understand the problem and why our works - at least until I get back to my desk. :)


        DWIM is Perl's answer to Gödel

      Ah, very nice. That works great. Thanks for everyone's help, I've been banging my head against the wall over this for days. :)