bobr has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I found somehow strange behavior when running two regexes in one function call. The original script was trying to count lines, words and chars similar way as wc does. Here is extracted snippet to demostrate:
use Data::Dump qw{dump}; $data = do { local $/ = undef; <DATA> }; # in single call dump [ scalar($data =~ s/\n/\n/g), scalar($data =~ s/(\w+)/\1/g), length($data) ]; # in multiple calls dump [ scalar($data =~ s/\n/\n/g) ]; dump [ scalar($data =~ s/(\w+)/\1/g) ]; dump [ length($data) ]; __DATA__ Line1 Word Something Line2 Other Word
The output is this
[6, 6, 38] [2] [6] [38]
I wonder why first line is not [2, 6, 38]. Can someone help me understand this? It looks like second regex result is overwriting also the first result. This is my first post on perlmonks, sorry if I did something wrong.

Replies are listed 'Best First'.
Re: two regexes in one function call
by BrowserUk (Patriarch) on Jul 28, 2009 at 16:18 UTC

    I can confirm your results on AS 5.10; and (maybe) point out that it does not appear to be conflict between the regexes as I get similar disparate results using tr to count the newlines:

    use Data::Dump qw{dump}; $data = do { local $/ = undef; <DATA> }; # in single call dump [ scalar( $data =~ tr[\n][\n] ), scalar($data =~ s/\n/\n/g), scalar($data =~ s/(\w+)/\1/g), length($data) ]; # in multiple calls dump [ scalar( $data =~ tr[\n][\n] ) ]; dump [ scalar($data =~ s/\n/\n/g) ]; dump [ scalar($data =~ s/(\w+)/\1/g) ]; dump [ length($data) ]; __DATA__ Line1 Word Something Line2 Other Word

    Gives:

    c:\test>783947.pl [6, 6, 6, 38] [2] [2] [6] [38]

    Update: Another set of conflicting results I cannot immediately explain?

    #! perl -sw use 5.010; $data = do { local $/ = undef; <DATA> }; say join ':', scalar( $data =~ tr[\n][\n] ), scalar($data =~ s/\n/\n/g), scalar($data =~ s/(\w+)/$1/g), length($data) ; my $x = [ scalar( $data =~ tr[\n][\n] ), scalar($data =~ s/\n/\n/g), scalar($data =~ s/(\w+)/$1/g), length($data) ]; say "@$x"; __DATA__ Line1 Word Something Line2 Other Word

    Gives:

    c:\test>783947.pl Use of uninitialized value in join or string at ... 6::6:38 6 6 6 38

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I noticed the tr as well, because I originally had it there. I have same result on both AS and strawberry.

      It looks like both having similar compile-time options:

      MULTIPLICITY PERL_DONT_CREATE_GVSV PERL_IMPLICIT_CONTEXT PERL_IMPLICIT_SYS PERL_MALLOC_WRAP PL_OP_SLAB_ALLOC USE_ITHREADS USE_LARGE_FILES USE_PERLIO
      AS has also USE_SITECUSTOMIZE specified.
Re: two regexes in one function call
by ikegami (Patriarch) on Jul 28, 2009 at 16:33 UTC

    It looks like second regex result is overwriting also the first result.

    Indeed.

    $data = do { local $/ = undef; <DATA> }; print \$_, "\n" for scalar($data =~ s/\n/\n/g), scalar($data =~ s/(\w+)/\1/g); __DATA__ Line1 Word Something Line2 Other Word
    SCALAR(0x239224) \ Same var! SCALAR(0x239224) /

    In ActivePerl 5.10.0 build 1004, s/// appears to place the return value in a global variable, and it appear to place that global variable (not a copy) on the stack.

    By the way, it's wrong to use \1 in the replacement expression. Use $1.

      I'm not sure that's the complete explaination as you get different results if you reverse the order of the statements:

      #! perl -sw use 5.010; $data = do { local $/ = undef; <DATA> }; my $x = [ scalar($data =~ s/(\w+)/$1/g), scalar($data =~ s/\n/\n/g), ]; say "@$x"; my $y = [ scalar($data =~ s/\n/\n/g), scalar($data =~ s/(\w+)/$1/g), ]; say "@$y"; __DATA__ Line1 Word Something Line2 Other Word

      Gives:

      c:\test>783947.pl 6 2 6 6

      (Plus tr isn't a regex!)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Minor change:
        #! perl -sw use 5.010; $data = do { local $/ = undef; <DATA> }; my $x = [map \$_, scalar($data =~ s/(\w+)/$1/g), scalar($data =~ s/\n/\n/g), ]; say "@$x"; my $y = [map \$_, scalar($data =~ s/\n/\n/g), scalar($data =~ s/(\w+)/$1/g), ]; say "@$y"; __DATA__ Line1 Word Something Line2 Other Word
        SCALAR(0x2392c4) SCALAR(0x239224) SCALAR(0x239314) SCALAR(0x239314)

        The Ys use the same variable for the return value, but not the Xs.

        I don't actually know *why* the two calls return the same variable.

        Interestingly, the problem goes away if the captures are removed.

        Update: There has been a change in the relevant code since 5.10.0, but it appears to be an optimisation and not a fix.

        5.10.0:
        PUSHs(sv_2mortal(newSViv((I32)iters)));

        maint-5.10 as it stands right now:
        mPUSHi((I32)iters);

        Patch

        In both case, it looks like the operator always returns a new SV, so maybe there's some stack corruption??

Re: two regexes in one function call
by moritz (Cardinal) on Jul 28, 2009 at 15:48 UTC
    It does print [2, 6, 38] for me, with perl-5.10.0 on Debian GNU/Linux.

    Maybe something's wrong with your version of perl?

Re: two regexes in one function call
by SuicideJunkie (Vicar) on Jul 28, 2009 at 16:08 UTC
    Why are you using substitutions in a "count" script?
    And then replacing whatever it was with itself?

    Best to not modify your input and risk changing things while you're trying to count.
    Note: See Perl Idioms Explained - my $count = () = /.../g
    use strict; use warnings; sub countIt { my $data = shift; my $lines =()= $data =~ /\n/g; $lines++ unless $data =~ /\n$/; # Last line may not have \n on it my $words =()= $data =~ /\w+/g; my $length = length($data); return ($lines, $words, $length); } print join ',', countIt(" Blah blah blah\nStuff-by:somebody");
    #Prints 2,6,34
      Agree, this is definitely more safe and probably much faster. Thanks for pointing this out, the idioms are very interesing.

      What I was rather surprised was the output of my short-minded solution, where all expressions alone worked, but combined together did not.

      I just recently tried other perl for windows, the strawberry, and it seems to produce very same result. Strange.

Re: two regexes in one function call
by FunkyMonk (Bishop) on Jul 28, 2009 at 15:48 UTC
    I get 2, 6, 38 for both sequences.
    $ perl -v This is perl, v5.10.0 built for x86_64-linux-gnu-thread-multi

    What version of perl, and which OS?

      Sorry for that, it is ActiveState perl 5.10.0:
      perl -v This is perl, v5.10.0 built for MSWin32-x86-multi-thread (with 5 registered patches, see perl -V for more detail)
Re: two regexes in one function call
by ELISHEVA (Prior) on Jul 28, 2009 at 20:23 UTC

    In your example, s// is modifying $data twice within one statement.

    Could it be that, as a general rule, the results of modifying a variable twice in a statement is unspecified? This might explain the Perl implementation dependent behavior discussed above. It certainly seems to be the case for the increment and decrement operator. Perhaps it is also true for s// and other operators that modify the variable upon which they operate? As stated in perlop (I've bolded the key words):

    Note that just as in C, Perl doesn't define when the variable is incremented or decremented. You just know it will be done sometime before or after the value is returned. This also means that modifying a variable twice in the same statement will lead to undefined behaviour.

    Best, beth

      That's not the problem. For starters, neither s/// modify the variable.

      There's problems with that quote anyway. Perl *does* define the operand evaluation order for some operators (namely the comma and the logical operators), it's regularly relied upon for some other operators (assignment operators and method call), and it's predictable for the rest.

      Update: Added second paragraph.

Re: two regexes in one function call
by bobr (Monk) on Jul 29, 2009 at 08:18 UTC
    Thanks to all for the ideas and suggestions. It looks like return value for all matches is in some cases returned into same SV, overwriting former result.