Why is goto &sub slow?

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

So I'm playing around with recursive subroutines and have results that are a bit confusing. Consider a normal recursive function:

sub factorial {
    my $num = shift;
    return 1 if 1 == $num;
    return $num * factorial($num = 1);
}

Ordinarily, we expect this to have quite a bit of overhead. Part of the reason for this is that we have deferred part of the calculation. For example, here's what this looks like if we try to calculate 5!


1:  factorial(5)
2:  (5 * factorial(4))
3:  (5 * (4 * factorial(3)))
4:  (5 * (4 * (3 * factorial(2))))
5:  (5 * (4 * (3 * (2 * factorial(1)))))
6:  (5 * (4 * (3 * (2 * 1))))
7:  (5 * (4 * (3 * 2)))
8:  (5 * (4 * 6))
9:  (5 * 24)
10: (120)

The problem with this is that we must defer part of the calculation until we get to the terminal condition (1 == $num, or step 5) of the recursive function. Then we go back through the call stack and finish the calculations. In theory, we get around this by building our our recursive function in such a way so that we don't need to walk back through the call stack, but instead have the value calculated every step of the way and return directly from the last call, which could allow us to skip steps 6 through 9 above. We can do this by rewriting our factorial function to keep track of the current value of the factorial and using the magical goto &sub notation.

sub factorial {
    my ($num) = @_;
    @_ = (1, 1, $num);
    goto &_factorial;
}
                                                                      
+                                   
sub _factorial {
    my ($result, $counter, $max) = @_;
    return $result if $counter > $max;
    @_ = (($result * $counter), ($counter + 1), $max);
    goto &_factorial;
}
[download]

After playing with this, I realized that I could do even better by using a lexically scoped subroutine and skipping the symbol table lookup, though it winds up being more cumbersome to program.

my $_factorial;
$_factorial = sub {
    my ($result, $counter, $max) = @_;
    return $result if $counter > $max;
    @_ = (($result * $counter), ($counter + 1), $max);
    goto $_factorial;
};

sub factorial {
    my ($num) = @_;
    @_ = (1, 1, $num);
    goto $_factorial;
}
[download]

While my benchmarks show that the lexically scoped subroutine with goto is slightly faster than the first version using goto, it's considerably slower than the straight recursive function. pdcawley noted in Pure Perl tail call optimization that using this form of goto is not very fast; I'm curious to know why it's so ridiculously slow. I've tested this on 5.8.0 and 5.9.1 (I mention 5.9.1 to make it clear that the memory leak associated with assignment to @_ and calling goto is not the culprit). Benchmark code and results below. Did I do something stupid?

I might add that I know an iterative solution is faster. I'm just trying to figure out why the goto is slow.

#!/usr/local/bin/perl
use warnings;
use strict;
use Benchmark;
use Test::More tests => 3;
 
my $_fact3;
$_fact3 = sub {
    my ($result, $counter, $max) = @_;
    return $result if $counter > $max;
    @_ = (($result * $counter), ($counter + 1), $max);
    goto $_fact3;
};
 
my $number = shift or die "No number supplied";
 
validate_factorial_routines($number);
 
print "Timing factorial of $number\n";
 
timethese(10000, {
    'recursion' => sub { fact1($number) },
    'typeglob'  => sub { fact2($number) },
    'lexical'   => sub { fact3($number) },
});
 
sub validate_factorial_routines {
    my $num = shift;
    my @result = (fact1($num), fact2($num), fact3($num));
    is(fact1(5), 120, 'fact1 should return the correct amount');
    is(fact2(5), 120, 'fact2 should return the correct amount');
    is(fact3(5), 120, 'fact3 should return the correct amount');
}
 
sub fact1 {
    my ($num) = @_;
    return _fact1(1,1,$num);
}
 
sub _fact1 {
    my ($result, $counter, $max) = @_;
    return $result if $counter > $max;
    return _fact1(($result * $counter), ($counter + 1), $max);
}
 
sub fact2 {
    my ($num) = @_;
    @_ = (1, 1, $num);
    goto &_fact2;
}
 
sub _fact2 {
    my ($result, $counter, $max) = @_;
    return $result if $counter > $max;
    @_ = (($result * $counter), ($counter + 1), $max);
    goto &_fact2;
}
 
sub fact3 {
    my ($num) = @_;
    @_ = (1, 1, $num);
    goto $_fact3;
}

__END__
1..3
ok 1 - fact1 should return the correct amount
ok 2 - fact2 should return the correct amount
ok 3 - fact3 should return the correct amount
Testing factorial of 30
120
Benchmark: timing 10000 iterations of lexical, recursion, typeglob...
   lexical:  4 wallclock secs ( 4.05 usr +  0.00 sys =  4.05 CPU) @ 24
+69.14/s (n=10000)
 recursion:  1 wallclock secs ( 1.45 usr +  0.00 sys =  1.45 CPU) @ 68
+96.55/s (n=10000)
  typeglob:  5 wallclock secs ( 4.28 usr +  0.00 sys =  4.28 CPU) @ 23
+36.45/s (n=10000)
[download]

Update: I realized that I did one thing different in the tests. If I (uselessly) duplicate the assignment to @_ in &_fact1, it's not much faster, but it's still faster.

1..3
ok 1 - fact1 should return the correct amount
ok 2 - fact2 should return the correct amount
ok 3 - fact3 should return the correct amount
Timing factorial of 30
Benchmark: timing 10000 iterations of lexical, recursion, typeglob...
   lexical:  4 wallclock secs ( 4.05 usr +  0.00 sys =  4.05 CPU) @ 24
+69.14/s (n=10000)
 recursion:  4 wallclock secs ( 3.87 usr +  0.00 sys =  3.87 CPU) @ 25
+83.98/s (n=10000)
  typeglob:  4 wallclock secs ( 4.28 usr +  0.00 sys =  4.28 CPU) @ 23
+36.45/s (n=10000)
[download]

Cheers,
Ovid

New address of my CGI Course.

Comment on Why is goto &sub slow? Select or Download Code

Replies are listed 'Best First'.
Re: Why is goto &sub slow? by broquaint (Abbot) on Mar 29, 2004 at 01:30 UTC
When you perform a `goto &sub` you're not just calling the function, but you're also subverting the stack by destroying the current frame and any changes made by `local` within that frame. That's probably not quite the whole story, but I'd say that'd be the major cause of the discrepancy. Also you're tests aren't quite accurate in that the standard subroutine calls don't populate `@_` whereas the `goto &sub` subroutines do. If you modify the tests you'll find the discrepancy will decrease¹. HTH `_________ broquaint` ¹ you'll find the test modified benchmark and results in the readmore below ... Read more... (3 kB)	[reply] [d/l]
Re: Re: Why is goto &sub slow? by Brutha (Friar) on Mar 29, 2004 at 07:43 UTC
First, thanks Ovid and broquaint. This is a well done example of Perl, benchmarking and test in a different context than make test. broquaint, if I understand Your example, there is a typo in sub _fact0, it should return `return _fact0(($result * $counter), ($counter + 1), $max);` [download] That makes recursion 1 twice as fast as recursion 2 And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!" (Terry Pratchett, Small Gods)	[reply] [d/l]
Re: Why is goto &sub slow? by kvale (Monsignor) on Mar 29, 2004 at 01:37 UTC
tilly posed the same question on p5p. That `goto &sub` is slow is a known issue at perl5porters. And in a subsequent message , it was claimed that "the C<goto &NAME> construct is not for tail recursion, but is for fooling the C<caller> function. ". I think someone also claimed that `goto LABEL` does not have the speed penalty of `goto &sub`. A further message notes that some memory leaks associated with `goto &sub` are cleaned up in 5.8.1. This does not answer your question: why is `goto &sub` slow? I don't know a definitive answer, but I gather it takes time to rearrange the stack and clean up the lexical pad before jumping to the new routine. -Mark	[reply] [d/l] [select]
Re: Re: Why is goto &sub slow? by dragonchild (Archbishop) on Mar 29, 2004 at 01:44 UTC
That doesn't gibe. Here's the two sequences, as I understand them. Standard recursion: First call to func(). (Create first frame on the stack) Second call to func(). (Create second frame on the stack) Return from second call. (Destroy second frame on the stack) Return from first call. (Destory first frame on the stack) goto ⊂ First call to func(). (Create first frame on the stack) `goto ⊂` (Destory first frame, create first frame) Return from goto call. (Destory first frame) I count two creations and two destructions of a stack frame. What am I missing? Plus, since `goto ⊂` doesn't have such a large stack, it takes less memory and lookups take less time. Right? ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose	[reply] [d/l] [select]
Re: Re: Re: Why is goto &sub slow? by kvale (Monsignor) on Mar 29, 2004 at 02:26 UTC
From L-R's updated benchmark and from your benchmark, recursion is only 2-10 percent faster than `goto &sub`. As you say above, both recursion and `goto &sub` create and destroy two frames. So the results seem consistent to me. Unless you are refering to that 10 percent slowdown for `goto &sub`? I don't know what is happening there. -Mark	[reply] [d/l] [select]
Re: Why is goto &sub slow? by dragonchild (Archbishop) on Mar 29, 2004 at 01:36 UTC
What about: `{ my ($end, $curr, $result); my $_factorial; $_factorial = sub { return $result if $curr > $end; $result = $curr++; $_factorial->(); }; sub fact4 { ($end, $curr, $result) = (shift, 1, 1); $_factorial->(); } }` [download] My benchmarking indicates that my version is 30% faster than the standard recursion and 6 times faster than goto. `dragonchild: 4 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 1 +22695.46/s (n=392012) lexical: 3 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 21 +976.48/s (n=68215) recursion: 3 wallclock secs ( 3.08 usr + 0.00 sys = 3.08 CPU) @ 91 +202.59/s (n=281360) typeglob: 4 wallclock secs ( 3.14 usr + 0.00 sys = 3.14 CPU) @ 21 +142.77/s (n=66494)` [download] ------ We are the carpenters and bricklayers of the Information Age.* Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose	[reply] [d/l] [select]
Re: Re: Why is goto &sub slow? by jdporter (Paladin) on Mar 29, 2004 at 22:53 UTC
But that's still full recursion, not tail call elimination, which is what `goto &sub` is. jdporter The 6th Rule of Perl Club is -- There is no Rule #6.	[reply] [d/l]
Re: Re: Re: Why is goto &sub slow? by dragonchild (Archbishop) on Mar 30, 2004 at 02:02 UTC
Ok. fine. `{ my ($end, $curr, $result); my $_factorial; $_factorial = sub { return $result if $curr > $end; $result = $curr++; goto &$_factorial; }; sub fact5 { ($end, $curr, $result) = (shift, 1, 1); $_factorial->(); } }` [download] And, the benchmarking goes something like: Timing factorial of 5 Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ +eglob for at least 3 CPU seconds... dragonchild1: 3 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ +106369.76/s (n=339745) dragonchild2: 3 wallclock secs ( 3.05 usr + 0.00 sys = 3.05 CPU) @ +89930.56/s (n=274558) lexical: 3 wallclock secs ( 3.06 usr + 0.00 sys = 3.06 CPU) @ 29 +376.63/s (n=90010) recursion: 4 wallclock secs ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 68 +035.79/s (n=216694) typeglob: 3 wallclock secs ( 3.01 usr + 0.00 sys = 3.01 CPU) @ 27 +140.10/s (n=81556) Rate typeglob lexical recursion dragonchild2 dr +agonchild1 typeglob 27140/s -- -8% -60% -70% + -74% lexical 29377/s 8% -- -57% -67% + -72% recursion 68036/s 151% 132% -- -24% + -36% dragonchild2 89931/s 231% 206% 32% -- + -15% dragonchild1 106370/s 292% 262% 56% 18% + -- [download] In other words, my tail-call is 90k/second, straight recursion is 70k/second, but my straight recursion is 105k/second. The regular tail-call are in the 25-30k/second. I lose about 10% going from recursion to tail-call. Now, when I shift from factorial(5) to factorial(15) and factorial(25), the benchmarking looks something like: Timing factorial of 15 Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ +eglob for at least 3 CPU seconds... dragonchild1: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ +47052.93/s (n=141347) dragonchild2: 3 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @ +40502.36/s (n=128595) lexical: 4 wallclock secs ( 3.24 usr + 0.00 sys = 3.24 CPU) @ 11 +222.80/s (n=36418) recursion: 3 wallclock secs ( 3.12 usr + 0.00 sys = 3.12 CPU) @ 26 +999.68/s (n=84104) typeglob: 3 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 10 +381.32/s (n=32234) Rate typeglob lexical recursion dragonchild2 dr +agonchild1 typeglob 10381/s -- -7% -62% -74% + -78% lexical 11223/s 8% -- -58% -72% + -76% recursion 27000/s 160% 141% -- -33% + -43% dragonchild2 40502/s 290% 261% 50% -- + -14% dragonchild1 47053/s 353% 319% 74% 16% + -- ------------------------------------ Timing factorial of 25 Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ +eglob for at least 3 CPU seconds... dragonchild1: 3 wallclock secs ( 3.01 usr + 0.00 sys = 3.01 CPU) @ +29099.20/s (n=87705) dragonchild2: 3 wallclock secs ( 3.12 usr + 0.00 sys = 3.12 CPU) @ +25319.04/s (n=79122) lexical: 4 wallclock secs ( 3.30 usr + 0.00 sys = 3.30 CPU) @ 68 +28.22/s (n=22499) recursion: 4 wallclock secs ( 3.23 usr + 0.00 sys = 3.23 CPU) @ 16 +312.62/s (n=52755) typeglob: 3 wallclock secs ( 3.23 usr + 0.00 sys = 3.23 CPU) @ 61 +14.10/s (n=19773) Rate typeglob lexical recursion dragonchild2 dr +agonchild1 typeglob 6114/s -- -10% -63% -76% + -79% lexical 6828/s 12% -- -58% -73% + -77% recursion 16313/s 167% 139% -- -36% + -44% dragonchild2 25319/s 314% 271% 55% -- + -13% dragonchild1 29099/s 376% 326% 78% 15% + -- [download] I have no idea why my recursion is still beating my tail-call. The ratio is mostly staying the same, at least between my versions. At 80, this is what happens: Timing factorial of 80 Benchmark: running dragonchild1, dragonchild2, lexical, recursion, typ +eglob for at least 3 CPU seconds... dragonchild1: 4 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @ +8834.80/s (n=28130) dragonchild2: 4 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ +7694.39/s (n=24422) lexical: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ 20 +67.44/s (n=6438) recursion: 3 wallclock secs ( 3.25 usr + 0.00 sys = 3.25 CPU) @ 50 +47.46/s (n=16379) typeglob: 3 wallclock secs ( 3.32 usr + 0.00 sys = 3.32 CPU) @ 18 +79.40/s (n=6249) Rate typeglob lexical recursion dragonchild2 dr +agonchild1 typeglob 1879/s -- -9% -63% -76% + -79% lexical 2067/s 10% -- -59% -73% + -77% recursion 5047/s 169% 144% -- -34% + -43% dragonchild2 7694/s 309% 272% 52% -- + -13% dragonchild1 8835/s 370% 327% 75% 15% + -- [download] I'm not sure what all those numbers really mean, but there they are. :-) ------ We are the carpenters and bricklayers of the Information Age.* Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose	[reply] [d/l] [select]
Re: Why is goto &sub slow? by biosysadmin (Deacon) on Mar 29, 2004 at 04:06 UTC
In the spirit of TIMTOWTDI, here's another option for people who are interested: Read more... (2 kB) It may not be exactly what you're talking about, but your thread made me think of it. :)	[reply] [d/l] [select]
Re: Re: Why is goto &sub slow? by Ovid (Cardinal) on Mar 29, 2004 at 14:52 UTC
I assume you realize that you listed a factorial table and not a fibonacci table? :) Cheers, Ovid New address of my CGI Course.	[reply]
Re: Re: Re: Why is goto &sub slow? by flyingmoose (Priest) on Mar 29, 2004 at 17:22 UTC
10+9+8+7+6+5+4+3+2+1 doesn't equal 2721600? Only if you are using base 10 math and boring Euclidean geometry. Think outside the box!	[reply]
Re: Why is goto &sub slow? by zentara (Cardinal) on Mar 29, 2004 at 17:40 UTC
Well it may be getting a little off-topic for your question, but how about how parrot and perl6 will do it? `#factorial.pasm print "The first 15 factorials are:\n" set I1, 0 set I2, 15 set I3, 1 REDO: inc I1 mul I3, I3, I1 print I3 print "\n" lt I1, I2, REDO DONE: end` [download] To run it, `parrot -o factorial.pbc factorial.pasm then parrot factorial.pbc` [download] The whole parrot assembly register setup is so cool, I think alot of Perl6 will be "Inline::Pasm", :-) I'm not really a human, but I play one on earth. flash japh	[reply] [d/l] [select]
Re: Why is goto &sub slow? by bunnyman (Hermit) on Mar 29, 2004 at 16:48 UTC
See node Re[4]: How's your Perl? and BrowserUk's reply to it. You might want to take a peek into the source code for pp_goto.	[reply]