davido has asked for the wisdom of the Perl Monks concerning the following question:

This all started out as just a little proof-of-concept for my personal amusement. I set out to create a regexp that uses the (?{....}) construct to parse a string of bits of arbitrary length, and return their decimal value. This sort of thing already exists with unpack and vec, but curiosity prevailed, and I just wanted to see what a regexp approach would look like in the end. Ultimately, the regexp engine isn't accomplishing much aside from iterating over each character in the bit-string. Plain old Perl code within the (?{....}) brackets is doing the work that might just as easily be done outside of the regexp. But that notwithstanding, it was entertaining to tinker with.

The gadget works great... under some conditions. It is the conditions that fail which have me perplexed to the point of needing to post this SoPW. First I'll present a working example:

use strict; use warnings; print bin_to_dec('1101100'); sub bin_to_dec { my $bits = shift; my( $power, $magnitude, $num ); die "$bits is not a pure bit string.\n" if $bits =~ m/[^10]/; if ( $bits =~ m/ (?{ $power = length($_) - 1; }) (?: ([10]) (?{ $magnitude = 2 ** $power; $^N eq '1' and $num += $magnitude; $power--; }) )+ /x ) { return $num; } else { die "Unable to resolve bits: $bits.\n"; } }

The output is '108', as you would expect, assuming high-order bit is at the left. And this subroutine (which is laboriously explicit in the spirit of providing a clear to understand snippet) works great for any bit string from one digit to over a thousand binary digits.

But look what happens when I read from the filehandle <DATA> to test a series of binary strings.

use strict; use warnings; while ( <DATA> ) { chomp; print bin_to_dec($_), "\n"; } sub bin_to_dec { my $bits = shift; my( $power, $magnitude, $num ); die "$bits is not a pure bit string.\n" if $bits =~ m/[^10]/; if ( $bits =~ m/ (?{ $power = length($_) - 1; $num = 0; }) (?: ([10]) (?{ $magnitude = 2 ** $power; $^N eq '1' and $num += $magnitude; $power--; }) )+ /x ) { return $num; } else { die "Unable to resolve bits: $bits.\n"; } } __DATA__ 00000000 00000011 00000111 11100000 __OUTPUT__ 0 Use of uninitialized value in print at test.pl line 9, <DATA> line 2. Use of uninitialized value in print at test.pl line 9, <DATA> line 3. Use of uninitialized value in print at test.pl line 9, <DATA> line 4.

I have tried to isolate the quirk by putting a print statement to print $num within the (?{....}) construct, and as I hoped, $num does get the appropriate value. But when I put a print "$num\n"; just before the subroutine's return, $num has no value. ...in the second snippet. In the first snippet there is no problem.

I have also tried using $num (and the other variables used within the regexp) as package globals, with our, as well as with use vars, thinking that maybe lexical scoping was causing my pain. In so doing, I declared those variables at the top of the script to give them the broadest possible exposure. Again, no change; the second snippet fails, and the first snippet works great.

So I turn to you guys to see if anyone else can confirm or deny this funky behavior. I'm using ActiveState Perl 5.8.4 on WinXP.


Dave

Replies are listed 'Best First'.
Re: Inexplicable uninitialized value when using (?{...}) regexp construct.
by blokhead (Monsignor) on Sep 28, 2004 at 19:01 UTC
    Changing the $num variable to a package variable fixes the problem for me as well. Here's what seems to be happening:
    • The first call to bin_to_dec compiles the regular expression. The (?{code}) block is compiled too, and the name $num is bound to the location of the lexical variable $num within the sub's scope
    • Any subsequent call to the bin_to_dec sub creates a new $num variable because of the my statement. But the regular expression is the same one from before (not recompiled), so the name $num is still bound to the lexical from the previous pad!
    You can verify this by printing $num from within the regular expression. It keeps getting updated correctly, but it's not the same $num that the return statement sees. Anyway, making it a package variable with local instead of my clearly fixes it.

    Also, a minor nit: An easier way to read in binary numbers left-to-right is:

    m/ (?{ $num = 0; }) (?: ([10]) (?{ $num = 2*$num + $^N; }) )+ /x
    i.e, shift left (which in binary multiplies by two), and carry in the new bit. This extends to any base representation (just exchange 2 for the base). You never have to know the length in advance. As a former TA for a models of computation class, I couldn't let that one slide ;)

    blokhead

      Lol, you've got to love a refresher in middle-school math. ;) Thanks.

      Taking that into consideration, and adding a built-in string validity check, here's how the working code now looks:

      use strict; use warnings; use Carp; while ( <DATA> ) { chomp; print "$_\t=>\t", bin_to_dec($_), "\n"; } sub bin_to_dec { our $num; # $num must be a global to work. local $num; # Play nice if $num is already in global use. $_[0] =~ m/ (?=^[10]+$) # Check for a valid string. (?{ $num = 0; }) (?: ([10]) (?{ $num = 2 * $num + $^N; }) )+ /x or croak "Error: $_[0] is not a pure bit string.\n"; return $num; } __DATA__ 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

      Thanks again...


      Dave

Re: Inexplicable uninitialized value when using (?{...}) regexp construct.
by dave_the_m (Monsignor) on Sep 28, 2004 at 19:07 UTC
    Currently, code like
    sub f { my $x; /(?{$x})/ }
    gets compiled a bit like
    sub f { my $x; sub hidden {$x} }

    and due to the way closures work, the value of $x as seen by the inner sub 'hidden' is the value of $x at the first call to f().

    I'm hoping to have this fixed by 5.10.0, assuming I can find enough tuits.

    Dave.

Re: Inexplicable uninitialized value when using (?{...}) regexp construct.
by BrowserUk (Patriarch) on Sep 28, 2004 at 18:58 UTC

    Switching my for our works fine for me with identical setup: AS 5.8.4/XP. Though I have AS 510 and there was an earlier version (AS508?) which I saw another problem that the move to 510 fixed. Maybe this is another?

    use strict; use warnings; while ( <DATA> ) { chomp; print bin_to_dec($_), "\n"; } sub bin_to_dec { my $bits = shift; our( $power, $magnitude, $num ); die "$bits is not a pure bit string.\n" if $bits =~ m/[^10]/; if ( $bits =~ m/ (?{ $power = length($_) - 1; $num = 0; }) (?: ([10]) (?{ $magnitude = 2 ** $power; $^N eq '1' and $num += $magnitude; $power--; }) )+ /x ) { return $num; } else { die "Unable to resolve bits: $bits.\n"; } } __DATA__ 00000000 00000011 00000111 11100000 __OUTPUT__ P:\test>junk 0 3 7 224

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      I must have been wrong when I said:
      I have also tried using $num (and the other variables used within the regexp) as package globals, with our, as well as with use vars, thinking that maybe lexical scoping was causing my pain. In so doing, I declared those variables at the top of the script to give them the broadest possible exposure. Again, no change; the second snippet fails, and the first snippet works great.

      I guess I wasn't atomic/careful enough in my testing. In converting to globals I may have broken something else, because you're right; after re-testing, converting to globals makes the script work.

      Based on your, and others' comments I dove back into perlre to see if I could find mention that lexicals are quirky when used within (?{...}) constructs. I didn't find any such warning, aside from the familiar "This feature is considered highly experimental", applied to the construct as a whole. Should this be submitted as a documentation patch? Is this topic addressed elsewhere in the POD?


      Dave

        I got bitten by this a few times early on, without necessarially realising that the cause was the closures caused by using lexicals in the code blocks. I just found that using our and/or local meant things worked as I wanted them too.

        This is one of the few places where the closure behaviour of Perl's lexicals is distinctly not useful.

        I've never seen any mention of this in the POD, though it has come up here and on p5p a few times. I think a documentation patch is a very good idea, though I have my doubts as to the usefulness of a full explanation of the causes and effects. I think a simple "Don't use lexicals in code assertions!" would probably suffice, be more beneficial and less confusing.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        Lexicals aren't "quirky" here. They are just used like any other closure. You just aren't convincing perl that it should go to the trouble of recompiling your regular expression each time. If you start your regex with something like (?#@{[rand]}) then you'll have a nicely random part of your regex to get it to recompile each time.
Re: Inexplicable uninitialized value when using (?{...}) regexp construct.
by runrig (Abbot) on Sep 28, 2004 at 20:34 UTC
    You are not the first to make the mistake of using lexicals in this construct (see the replies). Lexicals create closures which are usually not what you want here.
Re: Inexplicable uninitialized value when using (?{...}) regexp construct.
by ikegami (Patriarch) on Sep 28, 2004 at 19:12 UTC

    After running a number of tests ($_ vs "$_" vs global vs my for the argument. my vs our for the locals, <DATA> vs qw(), while vs foreach vs foreach var) I narrowed down when it fails: It always fails on the second and subsequent time it's called. I don't know why, though.

    my $my_var = '11100000'; print bin_to_dec($my_var), "\n"; # succeeds print bin_to_dec($my_var), "\n"; # fails! __END__ 224 Use of uninitialized value in print at a line 11.

    I used v5.8.0 built for i386-freebsd. $^N was only introduced in 5.8.0, btw