ruscoekm has asked for the wisdom of the Perl Monks concerning the following question:

Has anybody come across the following behavior...

When I run this code:

use Data::Dumper; my $re_bad = qr'@@([A-Z$#@_#]*) (?! [A-Za-z0-9$#@_] )'x; print Dumper $re_bad;
using Perl v5.8.0 (on Solaris), I get the following output:
$VAR1 = qr/(?x-ism:@@([A-Z$#@_#]*) (?! [A-Za-z0-9$#@_] )$ )/;
Notice the spurious characters at the end of the regex. Using Perl v5.6.1, the output is as expected:
$VAR1 = qr/(?x-ism:@@([A-Z$#@_#]*) (?! [A-Za-z0-9$#@_] ))/;
The only reason I noticed this is that the regex I was actually using was such that the characters scribbled onto the end were **, resulting in an invalid regex. So, I can imagine that this error would go undetected for some time. NB The issue is not with Data::Dumper. I just used that to illustrate. It appears that Perl is actually writing extra characters onto the regex. Obviously, I will report the bug, but I wanted to see if anybody else had come across it. This will stop us migrating to 5.8.0 unless we can find an acceptable workaround.

Cheers, Kevin

Replies are listed 'Best First'.
Re: Worrying regex issue with 5.8.0
by chromatic (Archbishop) on Nov 14, 2002 at 20:36 UTC

    Works for me just fine on 5.8.0 on Mac OS X and GNU/Linux and on 5.6.1 on Linux.

      Hmm. Thanks for that. May be Solaris only then. I will mention that in the report. The build is pretty vanilla. Any Solaris users out there? (Note that it works fine for me also with 5.6.1 on Solaris.)

      Kevin

        Works fine here as well.. heres my info:
        Solaris 8 (420R)
        [3:24pm] 11 [~]:msp-mainserver% perl -V Summary of my perl5 (revision 5.0 version 8 subversion 0) configuratio +n: Platform: osname=solaris, osvers=2.8, archname=sun4-solaris uname='sunos solaris 5.8 generic_108528-11 sun4u sparc sunw,ultra- +5_10 ' config_args='-Dcc=gcc -B/usr/ccs/bin/' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultipl +icity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc -B/usr/ccs/bin/', ccflags ='-fno-strict-aliasing -D_LARGEF +ILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O', cppflags='-fno-strict-aliasing' ccversion='', gccversion='3.1', gccosandvers='solaris2.8' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=1 +6 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc -B/usr/ccs/bin/', ldflags =' -L/usr/local/lib ' libpth=/usr/local/lib /usr/lib /usr/ccs/lib libs=-lsocket -lnsl -lgdbm -ldl -lm -lc perllibs=-lsocket -lnsl -ldl -lm -lc libc=/lib/libc.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-fPIC', lddlflags='-G -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: USE_LARGE_FILES Built under solaris Compiled at Jul 22 2002 02:55:19 @INC: /usr/local/lib/perl5/5.8.0/sun4-solaris /usr/local/lib/perl5/5.8.0 /usr/local/lib/perl5/site_perl/5.8.0/sun4-solaris /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl

        -Waswas

      Just got home. It works fine for me on Linux as well. So, clearly something specific to the environment at work.

      Cheers, Kevin

Re: Worrying regex issue with 5.8.0
by graff (Chancellor) on Nov 15, 2002 at 03:38 UTC
    This is some weird stuff, but I'm confused. Does the name of the variable ($re_bad) imply that expression typed by the programmer is bad, or are you saying that this should be considered an okay expression, and perl just seems to be doing something bad to it?

    I wonder, because this part:  [A-Z$#@_#] seems to be specifying "#" twice within a character class, which seems odd (should be harmless, granted, but odd). The "qr" operator is being used with single-quote delimiters, so the "@_" is being taken literally, rather than being interpolated (was that your intention?). When some other delimiter is used, the result is different: with nothing else in the script, "@_" is empty, so it seems to interpolate as an empty string in this case.

    I tried both ways, on both my home linux and my office solaris 5.8, and got results that were not identical, but equivalent, and worrisome in all cases. Feeding Data::Dumper's output through something like "od" helps to see the issue:

    solaris5.8 $ test.perl | od -t ax1 0000000 $ V A R 1 sp = sp q r / ( ? x - +i 24 56 41 52 31 20 3d 20 71 72 2f 28 3f 78 2d +69 0000020 s m : @ @ ( [ A - Z $ # @ _ # +] 73 6d 3a 40 40 28 5b 41 2d 5a 24 23 40 5f 23 +5d 0000040 * ) sp ( ? ! sp [ A - Z a - z 0 +- 2a 29 20 28 3f 21 20 5b 41 2d 5a 61 2d 7a 30 +2d 0000060 9 $ # @ _ ] sp ) nul nul lf ) / ; lf 39 24 23 40 5f 5d 20 29 00 00 0a 29 2f 3b 0a
    Note the null bytes. Is it supposed do that? I happened to get the exact same results on linux for this case.

    Now, if I change the delimiters on "qr" to something else (like slashes), which allows the "@_" to interpolate, the two systems do seem to differ:

    solaris5.8 $ test-slashes.perl | od -t ax1 0000000 $ V A R 1 sp = sp q r / ( ? x - +i 24 56 41 52 31 20 3d 20 71 72 2f 28 3f 78 2d +69 0000020 s m : @ @ ( [ A - Z # ] * ) sp +( 73 6d 3a 40 40 28 5b 41 2d 5a 23 5d 2a 29 20 +28 0000040 ? ! sp [ A - Z a - z 0 - 9 $ # +@ 3f 21 20 5b 41 2d 5a 61 2d 7a 30 2d 39 24 23 +40 0000060 _ ] sp ) can lf ) / ; lf 5f 5d 20 29 18 0a 29 2f 3b 0a ##### linux2.4 $ test-slashes.perl | od -t ax1 0000000 $ V A R 1 sp = sp q r / ( ? x - +i 24 56 41 52 31 20 3d 20 71 72 2f 28 3f 78 2d +69 0000020 s m : @ @ ( [ A - Z # ] * ) sp +( 73 6d 3a 40 40 28 5b 41 2d 5a 23 5d 2a 29 20 +28 0000040 ? ! sp [ A - Z a - z 0 - 9 $ # +@ 3f 21 20 5b 41 2d 5a 61 2d 7a 30 2d 39 24 23 +40 0000060 _ ] sp ) s lf ) / ; lf 5f 5d 20 29 73 0a 29 2f 3b 0a
    Where solaris had one garbage character (0x18), linux had a different character ("s"), which on the surface looks plausible, but is garbage nonetheless, I expect (e.g. it's whatever happens to have been at some point in core when a C function happens to step past the boundary of an array).

    Also, it seems that only the first instance of "@_" was interpolated -- if that makes sense, I need to read perlre again, much more carefully...

      I am glad that somebody managed to repeat the error. I think that people were beginning to think I was seeing things :-)

      I want to try some more tests myself but, to answer your questions:

      The variable name $re_bad is meant to imply that Perl is doing something bad with a valid regex. The regex is meaningless, that was just as small as I could get it. (As mentioned above, if I remove any further elements, I do not get the error.) The real regex is much longer. In fact, I got this error for a couple of regexes - one to match a SQL label and one to match a SQL global variable.

      Yes, I intentionally used single quote delimeters. I have also observed the behaviour when using other delimeters. However, so far, I only get the problem when using /x...

      Kevin

Re: Worrying regex issue with 5.8.0
by PodMaster (Abbot) on Nov 15, 2002 at 05:53 UTC
    This doesn't appear to be a perl issue, but rather a Data::Dumper issue.

    Why is Data::Dumper involved anyway?

    use Data::Dumper; my $re_bad = qr'@@([A-Z$#@_#]*) (?! [A-Za-z0-9$#@_] )'x; print Dumper $re_bad; print "\n\n\n"; print $re_bad; __END__ $VAR1 = qr/(?x-ism:@@([A-Z$#@_#]*) (?! [A-Za-z0-9$#@_] ))/; (?x-ism:@@([A-Z$#@_#]*) (?! [A-Za-z0-9$#@_] ))

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      Well, from your output, you appear not to get any spurious characters anyway.

      When I said above that Data::Dumper was not the issue, what I meant was that I was just using it to display the output. I originally observed the problem because the characters added to my regex were **, which caused the regex to be invalid (luckily, or I might not have spotted it for a while).

      However, Data::Dumper may yet be involved. When I run the following code:

      use Data::Dumper; my $re_bad = qr'xx([A-Z$#@_!]*) (?! [A-Za-z0-9$#@_] )'x; print $re_bad;

      the output is:

      (?x-ism:xx([A-Z$#@_!]*) (?! [A-Za-z0-9$#@_] )# )

      Note the spurious hash (or pound, if you prefer :-) at the end. Given the following code:

      my $re_bad = qr'xx([A-Z$#@_!]*) (?! [A-Za-z0-9$#@_] )'x; print $re_bad;

      the output is:

      (?x-ism:xx([A-Z$#@_!]*) (?! [A-Za-z0-9$#@_] ) )

      However, this does not yet prove that Data::Dumper is the culprit. Given the sporadic nature of the symptoms, it may well be that simply using any module of the right "size" will produce the error.

      Kevin

      Ok, latest summary...

      It appears that the error does not occur unless both Data::Dumper and /x are present. Since the error is so sporadic, it is hard to be sure.

      NB Just avoiding the use of Data::Dumper (say by using Dumpvalue instead) is not an option, since many CPAN modules which I am using themselves use Data::Dumper.

      I will hang on a bit longer, to see if anybody spots anything else, then submit a bug report.

      Cheers, Kevin

Re: Worrying regex issue with 5.8.0
by tommyw (Hermit) on Nov 15, 2002 at 10:31 UTC

    This looks to me as though it's the same problem I tripped over: perl -e '$d=qr /\#\#/x; print $d, "\n";' produces:

    (?x-ism:\#\#l )
    when I thought it ought to produce (?x-ism:\#\#)

    I sent it in as a bug report, and it's been fixed:

    The memory corruption bug should be corrected by change #17994 :
    
    Change 17994 by rgs@rgs-home on 2002/10/10 20:19:27
    
            Fix bug #17776 : memory corruption in qr/##/x
    

    The actual bug is due to:

     * So, if /x was used, we scan backwards from the
     * end of the regex. If we find a '#' before we
     * find a newline, we need to add a newline
     * ourself. If we find a '\n' first (or if we
     * don't find '#' or '\n'), we don't need to add
     * anything.  -jfriedl
    
    So if you can reorganise to stop this occuring, you'll be alright.

    --
    Tommy
    Too stupid to live.
    Too stubborn to die.

      Thanks Tommy - saved me from submitting a duplicate bug report.

      Ok, so Data::Dumper is not the issue. Is it your understanding that any regex which uses /x and which contains a hash (even in a character class) which is not followed by a newline will potentially corrupt memory? If so, that is pretty grim. The instruction would then be:

      Before upgrading to 5.8.0, check every regex which uses /x. For each one which contains a # not followed by newline, add the text # FIXME - workaround for bug #17776->newline here

      Cheers, Kevin

        Yes, or take the spaces out, and then dump the /x modifier. If you insist on keeping the original nicely formatted with spaces, then just pump it through s/ //g before passing it to qr (which was my solution). Obviously, this needs a certain amount of care too.

        --
        Tommy
        Too stupid to live.
        Too stubborn to die.