Forking Benchmarks?

simonm has asked for the wisdom of the Perl Monks concerning the following question:

Ever since jkeenan1 started an interesting benchmarking thread a few months ago, an idea has been rattling around in the back of my head, and I've finally decided to take a stab at a solution in hopes that it would quit bugging me.

The problem is that the standard Benchmark module doesn't isolate its test cases from one another. This means that the order that cases are run in can influence the results because side effects, either obvious or obscure, can accumulate and affect later tests.

Data in global variables is an obvious source of side effects; in the below example, the grep takes longer as more items are pushed onto the array, so the test functions that run later will be reported by Benchmark as being slower:

  cmpthese( 1000, {
    "test_1" => sub { push @global, scalar grep 1, @global },
    "test_2" => sub { push @global, scalar grep 1, @global },
    "test_3" => sub { push @global, scalar grep 1, @global },
  } );
[download]

To address this, I created a module that overrides the normal behavior of Benchmark to run each piece of code to be timed in a separate forked process. Just use Benchmark::Forking and the above benchmark reports the "correct" conclusion that the three tests run at approximately the same speed.

Feedback would be very welcome.

Are there problems with the accuracy of this technique?
Can you think of other examples of benchmark scripts this would not work for?
Can anyone see a way to make this work like a subclass of the core Benchmark module rather than meddling with its internals?
Is there something equivalent to this already in circulation?
Any other issues I should consider before posting it to CPAN?

package Benchmark::Forking;

use strict;
use Benchmark;

use vars qw( $VERSION $Enabled $RunLoop );
BEGIN { 
  $VERSION = 0.9;
  $Enabled = 1; 
  $RunLoop = \&Benchmark::runloop;
}

sub enable   { $Enabled = 1 }
sub disable  { $Enabled = 0 }
sub enabled  { $#_ ? $Enabled = $_[1] : $Enabled }

sub import   { enable(); goto &Benchmark::import }
sub unimport { disable() }

sub Benchmark::runloop {
  $Enabled or return &$RunLoop;
  
  open( FORK, '-|') or print join "\n", @{ &$RunLoop } and exit;
  my @td = <FORK>;
  close( FORK ) or die $!;
  bless \@td, 'Benchmark';
};

1;
[download]

Update: I removed the old POD from this post, but it's still available here. An updated version is now on CPAN.)

Comment on Forking Benchmarks? Select or Download Code

Replies are listed 'Best First'.
Re: Forking Benchmarks? by tachyon (Chancellor) on Sep 04, 2004 at 10:26 UTC
One issue is that `open( FORK, '-\|')` won't work on a number of OS with MSWin32 top of the list. This could be a bug or a feature depending on your outlook. cheers tachyon	[reply] [d/l]
Re: Forking Benchmarks? by Rhys (Pilgrim) on Sep 04, 2004 at 10:37 UTC
I think it's worth mentioning in your POD that it's wisest to test code both with and without forking enabled, if your platform supports it. Enabling forking is a better test of each piece of code, but disabling forking is a better test of how they interact in a real script. One might even go so far as to say that differences between the two sets of results should throw up yellow flags. (Maybe expected, maybe not, but certainly the place where algorithm analysis should be focused.) The average scripter may need you to include this short summary of your original point. ;-) BTW, I love the title for this thread. Almost, but not quite, vulgar. :-D	[reply]
Re^2: Forking Benchmarks? by Aristotle (Chancellor) on Sep 04, 2004 at 15:50 UTC
I don't see either point. You don't want your benchmarks to interact, and normally have to make sure they don't. Forking saves you that trouble. That also means yellow flags should be raised only if you wanted to use the non-forking benchmark as the baseline — but why? Sure, if you find differences and didn't expect any, it's worth investigating the source of the interaction — if it's not in your own benchmarked code, modules you pull in might have an issue you weren't aware of. But beyond that, provided with a means to entirely isolate benchmarks, I just don't see any reason to go to the trouble to make them "clean". Makeshifts last the longest.	[reply]
Re^3: Forking Benchmarks? by Rhys (Pilgrim) on Sep 04, 2004 at 17:35 UTC
In the example given, there are two possible scenarios. 1) The intention is to test how each version of a piece of code handles a specific problem. In this case, you're exactly right. 2) The intention is to test how each piece of code is performing in a larger prog. In this case, both the performance of the individual segments and the interactions among those segments in the real-world case are valuable, so you want both the 'isolated' and 'non-isolated' cases. In any event, I like the module. In case 1, it allows for testing of several very similar segments of code non-interactively at once, regardless of whether the coder knows they would otherwise be interactive. (Another goot habit, like 'use strict'.)	[reply]
Re: Forking Benchmarks? by graff (Chancellor) on Sep 04, 2004 at 16:09 UTC
I was puzzled slightly by this bit in the pod: (Note that while each case runs in a separate process, all of the repetitions of any one case are run together.) Does that mean that the cases are run serially (as in the original Benchmark approach) -- that is, case #2 doesn't start until case #1 is finished? That might be a dumb question, because I don't have any experience using a two-arg open where the second arg is "-\|" -- ~~perdoc -f open doesn't discuss this usage directly, and~~ I must confess I'm baffled as to how you are actually making it work here. You might consider including a brief explanation (either as pod or as "#" comments in the code). (Update: thanks for the correction, simonm -- I had missed that part of the docs earlier.) Anyway, I'm asking about serial-vs-concurrent execution because forking could allow multiple sub-processes to run simultaneously -- and even though they are independent processes, they will typically be sharing a single cpu, which means that the timing report that comes back could depend on the number of cases being tested. Also, if the jobs are concurrent, this would place constraints on what the test code can do in terms of file i/o -- the user would have be careful to use different file names for each case (especially on output files, though using the same input file could throw off the stats due to low-level caching in the hardware or the OS). Minor nit -- when I read the post, the second paragraph under "DESCRIPTION" in the pod ended with "In some cases" -- either you meant to finish the sentence or you meant to delete those three words (but you didn't do it yet).	[reply]
Re^2: Forking Benchmarks? by simonm (Vicar) on Sep 04, 2004 at 17:07 UTC
Does that mean that the cases are run serially (as in the original Benchmark approach) -- that is, case #2 doesn't start until case #1 is finished? Correct, the cases are run serially, with the parent process waiting for the forked child to complete the timing cycle before it proceeds. I don't have any experience using a two-arg open where the second arg is "-\|" -- perdoc -f open doesn't discuss this usage directly You may need to just look harder... If you open a pipe on the command '-', i.e., either '\|-' or '-\|' with 2-arguments (or 1-argument) form of open(), then there is an implicit fork done, and the return value of open is the pid of the child within the parent process, and 0 within the child process. (Use defined($pid) to determine whether the open was successful.) The filehandle behaves normally for the parent, but i/o to that filehandle is piped from/to the STDOUT/STDIN of the child process. In the child process the filehandle isn't opened-- i/o happens from/to the new STDOUT or STDIN. [download] I'll update the comments to mention this, and fix the documentation nit you pointed out. Thanks for the feedback!	[reply] [d/l]
Re: Forking Benchmarks? by pbeckingham (Parson) on Sep 04, 2004 at 17:06 UTC
If the forked processes don't end simultaneously, won't the remaining processes get unnatural speed boosts, increasing divergence in the results? pbeckingham - typist, perishable vertebrate.	[reply]
Re^2: Forking Benchmarks? by Aristotle (Chancellor) on Sep 04, 2004 at 18:54 UTC
Only in wallclock time, not in terms of user/system CPU time. Makeshifts last the longest.	[reply]
Re^2: Forking Benchmarks? by simonm (Vicar) on Sep 04, 2004 at 17:08 UTC
If the forked processes don't end simultaneously, won't the remaining processes get unnatural speed boosts, increasing divergence in the results? As noted above, the processes are run serially so do not compete for resources. I'll update the documentation to make this clear.	[reply]
Re: Forking Benchmarks? by qq (Hermit) on Sep 06, 2004 at 00:13 UTC
Perhaps include a url for the original perlmonks node in the Thanks To? It is an interesting thread. qq (hi there)	[reply]