Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have been messing around with optimizing code for performance and been benchmarking a few mthods and recording the difference in speed of execution.

The following code I wrote as a simple test to compare the difference between the 2 "for loop" methods of syntax.

#!/usr/bin/perl -w use Benchmark; use strict; my $results = timethese( 30, { a => sub{my $total = 0; $total += $_ for(1..1e6);}, b => sub{my $total = 0; for(1..1e6){$total += $_;}}, } );
Here are the results returned when run:
Benchmark: timing 30 iterations of a, b... a: 8 wallclock secs ( 8.30 usr + 0.00 sys = 8.30 CPU) @ 3 +.62/s (n=30) b: 9 wallclock secs ( 9.03 usr + 0.02 sys = 9.05 CPU) @ 3 +.32/s (n=30)
Does anyone have any idea why the method is subroutine two() is slower in execution than the method in subroutine one()?

The recorded difference might not seem like much but the larger the array context is the larger the delta between the two becomes. Try changing the "1e6" to "1e7" and you can see what I am talking about.

Replies are listed 'Best First'.
Re: Benchmarking for loops?
by chromatic (Archbishop) on Jul 13, 2004 at 00:00 UTC

    Looking at the output of B::Concise, there are two extra opcodes in the block version. Compare:

    $ perl -MO=Concise -e 'my $total = 0; $total += $_ for(1..1e6);'

    to:

    $ perl -MO=Concise -e 'my $total = 0; for(1..1e6){$total += $_;}'

    At first glance, it looks like the opcodes manage entering and exiting the block scope.

      Yup, and similar things will happen with grep EXPR, LIST vs grep BLOCK LIST (as well as the corresponding maps). If you're really looking to eke out those last few cycles and can do without a full block (maybe you can get by with wrapping the grep or map in a block and reusing a lexical scoped to that in the EXPR) this can shave a bit off.

      What do you see? I see only one opcode difference (an extra nextstate). Which perl version?

        Here's a diff. This is a vanilla 5.8.3 on Linux PPC.

        --- postfix 2004-07-12 22:45:53.000000000 -0700 +++ block 2004-07-12 22:45:44.000000000 -0700 @@ -3,23 +3,23 @@ 2 <;> nextstate(main 1 -e:1) v ->3 5 <2> sassign vKS/2 ->6 3 <$> const(IV 0) s ->4 -4 <0> padsv[$total:1,2] sRM*/LVINTRO ->5 -6 <;> nextstate(main 2 -e:1) v ->7 -7 <;> nextstate(main 2 -e:1) v ->8 +4 <0> padsv[$total:1,4] sRM*/LVINTRO ->5 +6 <;> nextstate(main 3 -e:1) v ->7 j <2> leaveloop vK/2 ->k -c <{> enteriter(next->g last->j redo->d) lKS ->h -- <0> ex-pushmark s ->8 -- <1> ex-list lK ->b -8 <0> pushmark s ->9 -9 <$> const(IV 1) s ->a -a <$> const(NV 1000000) s ->b -b <$> gv(*_) s ->c +b <{> enteriter(next->g last->j redo->c) lKS ->h +- <0> ex-pushmark s ->7 +- <1> ex-list lK ->a +7 <0> pushmark s ->8 +8 <$> const(IV 1) s ->9 +9 <$> const(NV 1000000) s ->a +a <$> gv(*_) s ->b - <1> null vK/1 ->j -i <|> and(other->d) vK/1 ->j +i <|> and(other->c) vK/1 ->j h <0> iter s ->i - <@> lineseq vK ->- -f <2> add[t2] vKS/2 ->g -d <0> padsv[$total:1,2] sRM ->e +c <;> nextstate(main 2 -e:1) v ->d +f <2> add[t4] vKS/2 ->g +d <0> padsv[$total:1,4] sRM ->e - <1> ex-rv2sv sK/1 ->f e <$> gvsv(*_) s ->f g <0> unstack v ->h

        I believe (but do not know for absolutely sure) that nextstate is the opcode that sets things up to enter a block or maybe it is exit the block, I can't remember for sure (something is making me think that the op-tree is traversed/processed in post-order rather than pre-order and that might make the difference). If you are really interested you might want to check out Simon Cozens Perl 5 Internals paper, in particular the end part talks about some of the output of the B modules, and how to read/understand them.

        The (overly simplistic) reason why entering an exiting a block has a performance penalty is that perl is lexically scoped. So when entering each block a new lexical scratchpad needs to be set up, and then upon exiting the block all variables need to be restored.

        -stvn
Re: Benchmarking for loops?
by keszler (Priest) on Jul 13, 2004 at 00:27 UTC
    In Benchmark.pm: Does subroutine testing order bias results? it was suggested that the first test is consistently faster, even if the tests are identical.

    I added a 2nd "timethese" to your code, with the tests reversed:

    Benchmark: timing 30 iterations of a, b... a: 26 wallclock secs (26.64 usr + 0.00 sys = 26.64 CPU) @ 1 +.13/s (n=30) b: 32 wallclock secs (31.64 usr + 0.00 sys = 31.64 CPU) @ 0 +.95/s (n=30) Benchmark: timing 30 iterations of a2, b2... a2: 20 wallclock secs (20.10 usr + 0.00 sys = 20.10 CPU) @ 1 +.49/s (n=30) b2: 28 wallclock secs (27.25 usr + 0.00 sys = 27.25 CPU) @ 1 +.10/s (n=30)

    It appears that at least some of the difference is due to the order of the tests.

      Do you mean to say that sub "a2" is identical to "b", and "b2 is the same code as "a"? (I just wasn't sure what you meant.)
        Yes, exactly. Cut&paste, rename a -> b2, b -> a2.
Re: Benchmarking for loops?
by ysth (Canon) on Jul 13, 2004 at 00:02 UTC
    Try removing the last ; from b

    Update: incidentally, that was sarcasm directed at what sounded like premature optimization.