I've spent 2 days staring and thinking about that benchmark and it is really hard not to be horribly scathing about it.
But I don't want to do that because Coro is a very clever piece of code by an obviously very bright guy.
Still, the best conclusion I can draw from the benchmark is that if you do the same very strange and silly thing
with both iThreads and Coro, Coro will do it more quickly. Which is hardly great recommendation.
It would be all too easy to draw other, even less flattering conclusions regarding the purpose of the benchmark.
To explain what I mean, lets examine what the benchmark code does. To this end, I'll use this version which just
removes all the Coro code so that we can concentrate on what it actually does.
It starts $T threads. We'll come back to these shortly.
It then starts $T more threads, each of which goes into an *endless loop*:
while() {
Yes! I was baffled by that construct, but if you run perl -E"while(){ say 'Hi' }" it's and endless loop.
each iteration of which constructs two $N x $N matrices (he uses floats but no matter). So for $N = 2 they might look something like this:
@a = [ [ 1, 2 ], [ 3, 4 ] ]
@b = [ [ 5, 6 ], [ 7, 8 ] ]
It then pushes (*shared*) arrays containing *copies* of each combination of pairs of rows from these two matrixs,
plus their original positions, onto a work queue.
$Q =
[ 0, 0, [ 1, 2 ], [ 5, 6 ] ]
[ 0, 1, [ 1, 2 ], [ 7, 8 ] ]
[ 1, 0, [ 3, 4 ], [ 5, 6 ] ]
[ 1, 1, [ 3, 4 ], [ 7, 8 ] ]
So, back to those first 4 threads he started.
They sit in governed, but also *endless* loops, reading from the work queue.
When the get one of those Q elements above,
they unpack the elements into local variables,
and then copy the contents of both the subarrays (which are shared, but only ever processed by a single thread!),
into *local (anonymous array!?) copies*
It then iterates the two local subarrays in parallel, summing their products.
It then construct a local unshared anonymous array containing the x & y from the original input, plus the sum of products.
It then shares that anonymous array (which empties them!),
before pushing it back on to a results queue.
Which means he is pushing empty shared anonymous arrays onto the results queue?
Now finally, the main queue sits reading from the results queue, ignoring the results but counting.
Then after a (arbitrary) count of 63(!?), it starts a timer.
Then it continues counting and discarding results until the count satisfies this condition:
elsif (($count & 63) == 0) {
and if at that point, at least 5 seconds have elapsed(!?)
if (time > $t + 5) {
it prints out a number that is some manipulation of the time it ran, the size of the matrixs and the count,
printf "%f\n", $count / ($N * $N * (time - $t));
last;
and exits with the 8 threads still running(!?).
None of this makes any sense whatsoever.
- He is not performing a matrix multiplication.
At the very least he would have to
- Transform matrix B,
- Actually put the results onto the results queue.
As is, he populates a non-shared anonymous array, and then shares it;
which wipes the contents, meaning what he pushes into the results queue is an empty array
- Build a results matrix C from the results he calculates.
But those are the least of the problems.
- It doesn't make sense to split a matrix multiplication of two 50x50 matrixs across multiple threads! Much less break the calculation down by rows.
A modern processor can complete the entire 2 * 50x50 multiplication (using the simplest naive algorithm), in much less than a single timeslice.
And far less time than it takes to:
Spawn a thread;
- Copy the data to shared memory;
- Queue that shared data;
- Retrieve that data;
- And copy it to local variables;
- And what's up with all that copying?
The whole point of shared data is that you can share it. There is no point in making data shared and then copying it to local variables to use it!
- And why do the threads continue pumping data onto the queues?
- Why does he never clean up the queues?
- Why does he discard the first 63 results?
- And what is that final value he is printing out?
To the very best of my abilities to interpret, it is an near to a random value as I can discern.
It is of absolutely no value whatsoever as a benchmark of threading!
I'm glad that you noticed that Coro's main claim--Coro - the only real threads in perl--is bogus, because most people will not. iThreads are "real threads"; and all the more real because they can at least make use of multiple cores. Unlike (any flavour) of user-space cooperative threading.
Just like most of them will take the benchmark results at face value, without ever looking closely enough to see that they are equally bogus.If you break a piece of data into iddy-biddy pieces, subject it to a bunch of unnecessary copying; and then farm it off to threads for manipulations that take far less than a single timeslice to process before copying them some more before queing them back to the main thread to re-assemble--whilst all the time continuing to fire more and more data onto the queues, that you are never going to process, but that will cause lock states that will slow everything down--then it will take longer than if you just used a single process. But that's not news!
It is perfectly possible to utilise multiple cores, via multiple threads, to improve the performance of matrix multiplication--provided the matrix involved are sufficiently large to actually warrent it.
But this is not the way to do it.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|