Re^3: profiling XS routines

but it does not work unless perl is the only process running, and the OS never schedules another task, and the OS does not run code outside perl (e.g. in kernel mode). rdtsc does not know about processes and threads in the OS. It just counts CPU clock cycles, including those spent in other applications and the kernel.

Whilst all of that is true, it is also true of pretty much every other timing method you want to name. There is no practical way of establishing exact timings with modern, deeply piped-lined, hyperthreading, multi-cored cpus running multi-tasking OSs.

With processors that perform out-of-order execution; and through hyperthreading, can interleave the instructions of 2 or more separate processes through the same core; cache-line refreshes that can stall a process for relative eons; OSs that can preempt and re-prioritise at will; compilers that re-order and/or eliminate code; there is no practical way of exactly instrumenting any single piece of code. Quite literally: No way!.

And even if there were--say by booting a highly simplified, single-tasking bootstrap kernal that just loads the CuT and times how long it takes to execute--the results would be so unrepresentative of any real-world usage as to be entirely meaningless. That's why when you read these thesis papers that optimise algorithms for maximum cache coherency and make huge claims for efficiencies, you should reach for the salt cellar.

As with any benchmarking or profiling technique that you apply on any multi-tasking OS; or any multi-cored or multi-threaded processor, the best you can do is:

set the affinity for the process;
(maybe, set the affinities of every other process to a different core.)
(maybe, disable power-saving features)
run multiple iterations of the CuT;
average the results.

Repeat the above after each code/algorithm/compile option change to obtain roughly comparible timings.

Mostly, so long as you don't do one set of timings whilst your virus scanner runs and you're listening to videos whilst playing Grand Turismo 5--and the other set whilst not--averaging an appropriate number of runs will give you a reasonable notion of whether changes you've made are moving things in the right direction or not.

That said, the three single biggest optimisations I've made in my own XS routines are: a) do as much work in each individual call as possible; b) pass any mildy sizable quantity of data by reference rather than through the stack--including largish scalars; c) if the same data is going to be passed to the XS code on multiple occasions, pack it once and pass by reference and only unpack it when you need to.

Indeed, second to a dear friends "Algorithm, algorithm, algorithm", the next best method of software optimisation is to do as little as possible! In Perl that means: don't (re)allocate memory; don't copy unnecessarially; don't make unnecessary sub/method calls; and don't convert between formats until you have to.

Those rules-of-thumb can be emphasised by running a benchmark of two essentially similar pieces of code:

sub sum1{ 
    my @numbers = @_; 
    my $total = 0; 
    for my $number ( @numbers ){ 
        $total = "$total" + "$number" 
    } 
    return $total 
};;

sub sum2{ 
    my $total = 0; 
    $total += $_ for @{ $_[0] }; 
    return $total 
};;

@n = 1 .. 1e6;;

cmpthese -1, { A=>q[ my $total=sum1( @n ); ], B=>q[ my $total=sum2( \@
+n ); ] };;

    Rate    A    B
A 1.03/s   -- -90%
B 10.8/s 949%   --
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP PCW It is as I've been saying!(Audio until 20090817)

Comment on Re^3: profiling XS routines Download Code

Replies are listed 'Best First'.
Re^4: profiling XS routines by syphilis (Archbishop) on Aug 30, 2009 at 01:34 UTC
b) pass any mildy sizable quantity of data by reference rather than through the stack--including largish scalars; Despite the plausibility of this, I find it makes very little difference (where XSubs are concerned) whether the data is passed by reference or via the stack. Here's a demo: use warnings; use Benchmark; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'EOC'; /* These subs taken from * * "Extending and Embedding Perl" * * by Jenness and Cozens / double sum_as_ref(AV avref) { int i, len; double sum = 0; SV ** elem; len = av_len(avref) + 1; for(i = 0; i < len; i++) { elem = av_fetch(avref, i, 0); sum += (double)SvIV(elem); } return sum; } double sum_as_list(SV arg1, ...) { int i, len; double sum = 0; SV * elem; Inline_Stack_Vars; len = Inline_Stack_Items; for(i = 0; i < len; i++) { elem = Inline_Stack_Item(i); sum += (double)SvIV(elem); } return sum; } EOC our @x = (1 .. 2e6); our($list, $ref); timethese (1, { 'list' => '$list = sum_as_list(@x);', 'ref' => '$ref = sum_as_ref(\@x);', }); print "$list\n$ref\n"; [download] And the result: `Benchmark: timing 1 iterations of list, ref... list: -1 wallclock secs ( 0.14 usr + 0.00 sys = 0.14 CPU) @ 7 +.09/s (n=1 ) (warning: too few iterations for a reliable count) ref: 0 wallclock secs ( 0.14 usr + 0.00 sys = 0.14 CPU) @ 7 +.09/s (n=1 ) (warning: too few iterations for a reliable count) 2000001000000 2000001000000` [download] Cheers, Rob	[reply] [d/l] [select]
Re^5: profiling XS routines by BrowserUk (Patriarch) on Aug 30, 2009 at 02:58 UTC
Hm. I cannot verify this currently as I haven't managed to get Inline::C to work again since I installed 5.10.1--I'm being fraked by Sw...EU::MM (again!)--but I'm really skeptical of that benchmark, without that I can actually see anything wrong with it on visual inspection. What those figures says, is that it cost literally no time at all to allocate space for, copy, and push two million scalars? Sorry Rob, but I just do not believe that can be true. If I succeed in working out why EU::MM is suddenly calling a non-existant method `MM::_all_prereqs()`, then I'll attempt to refute it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW It is as I've been saying!(Audio until 20090817)	[reply] [d/l]
Re^5: profiling XS routines by BrowserUk (Patriarch) on Aug 30, 2009 at 04:40 UTC
Okay. I got I::C to work, and these are the results I get from running your (unmodified) benchmark here: C:\test>792136.pl Benchmark: timing 1 iterations of list, ref... list: 0 wallclock secs ( 0.05 usr + 0.00 sys = 0.05 CPU) @ 21 +.28/s (n=1) (warning: too few iterations for a reliable count) ref: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 32 +.26/s (n=1) (warning: too few iterations for a reliable count) 2000001000000 2000001000000 C:\test>792136.pl Benchmark: timing 1 iterations of list, ref... list: 0 wallclock secs ( 0.05 usr + 0.00 sys = 0.05 CPU) @ 21 +.28/s (n=1) (warning: too few iterations for a reliable count) ref: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 31 +.25/s (n=1) (warning: too few iterations for a reliable count) 2000001000000 2000001000000 C:\test>792136.pl Benchmark: timing 1 iterations of list, ref... list: 0 wallclock secs ( 0.05 usr + 0.00 sys = 0.05 CPU) @ 21 +.28/s (n=1) (warning: too few iterations for a reliable count) ref: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 31 +.25/s (n=1) (warning: too few iterations for a reliable count) 2000001000000 2000001000000 [download] Which is consistant enough to confirm my intuition. I'm not sure why you are getting different results--but I urge you to look closely at them. BTW: Can you tell me where PERL_BITFIELD16 is typedef'd on your setup, because it was entirely missing as far as I can see on mine. Once I added: `#define OPCODE U16 #ifdef PERL_MAD # define MADPROP_IN_BASEOP MADPROP* op_madprop; #else # define MADPROP_IN_BASEOP #endif typedef unsigned int PERL_BITFIELD16; /// <<< I added this! typedef PERL_BITFIELD16 Optype;` [download] I::C started working again? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW It is as I've been saying!(Audio until 20090817)	[reply] [d/l] [select]
Re^6: profiling XS routines by syphilis (Archbishop) on Aug 30, 2009 at 05:37 UTC
Can you tell me where PERL_BITFIELD16 is typedef'd on your setup For me, it's in lib/CORE/perl.h: `/* macros to define bit-fields in structs. / #ifndef PERL_BITFIELD8 # define PERL_BITFIELD8 unsigned #endif #ifndef PERL_BITFIELD16 # define PERL_BITFIELD16 unsigned #endif #ifndef PERL_BITFIELD32 # define PERL_BITFIELD32 unsigned #endif` [download] As for the timings, I've done such comparisons before on other machines and other operating systems - eg on my aging linux box (Pentium-III) running the same script (but with 10 iterations instead of 1) I get 0.42 for 'list' and '0.49' for 'ref'. Whenever I've run this sort of comparison, I've found very little difference. Given that you're measuring such miniscule amounts of time, it might be more meaningful if you were to run (say) 30 iterations instead of 1 - ie: `timethese (30, { 'list' => '$list = sum_as_list(@x);', 'ref' => '$ref = sum_as_ref(\@x);', });` [download] I don't know why this AMD64 (Vista64) timing is so slow in comparison to your box and my aging linux box .... didn't realise it was* so slow until now.... Cheers, Rob	[reply] [d/l] [select]