in reply to Re^4: If I am tied to a db and I join a thread, program chrashes
in thread If I am tied to a db and I join a thread, program chrashes

After having some time to think about what I wrote yesterday I too realized that the conclusion one can draw from the matrix multiplication example is different: "Don't parallelize communication heavy algorithms". Still, it shows that the sharing has its costs (as you too point out). One result from Marcs tests was that the Coro implementation running on one core was 25-35% faster with a perl without ithread support. I.e. everyone with a standard perl installation is paying this price with or without using threads.

It is (all but) impossible to parallelise matrix multiplication using "real processes" alone

I don't think so. If you can afford 4 times the memory usage a process implementation will be nearly 4 times faster on a quad-core. Every process uses its own non-shared memory to calculate one fourth of the rows.

Marc Lehmann achieves his results by using threads. Albeit that they are a user space implementation of cooperative threading, it is still threading ...

Now we've established that a form of threading is required!

Note that his Coro threads don't have any parallelization potential for multi-cores. They should run exactly the same on one core vs. a quad-core. He could have used a simple non-threaded matrix-multiplication and it would have run exactly the same. His Coro threads have two uses as I see it: Making it easier to program independent tasks in a program and speeding up programs where I/O, user input etc. provide opportunities for parallel tasks. They don't help with multi-cores at all

This is the reason Marc Lehmann only used a single core to benchmark his Coro-implementation. And therefore your conclusion is wrong. Threads are not required

To answer your question about the code, his implementation of the benchmark can be found at http://data.plan9.de/matmult. Please check it out, I will try to do it tonight when I have some time.

  • Comment on Re^5: If I am tied to a db and I join a thread, program chrashes

Replies are listed 'Best First'.
Re^6: If I am tied to a db and I join a thread, program chrashes
by BrowserUk (Patriarch) on Jun 08, 2009 at 14:03 UTC

    I've spent 2 days staring and thinking about that benchmark and it is really hard not to be horribly scathing about it. But I don't want to do that because Coro is a very clever piece of code by an obviously very bright guy.

    Still, the best conclusion I can draw from the benchmark is that if you do the same very strange and silly thing with both iThreads and Coro, Coro will do it more quickly. Which is hardly great recommendation. It would be all too easy to draw other, even less flattering conclusions regarding the purpose of the benchmark.

    To explain what I mean, lets examine what the benchmark code does. To this end, I'll use this version which just removes all the Coro code so that we can concentrate on what it actually does.

    It starts $T threads. We'll come back to these shortly.

    It then starts $T more threads, each of which goes into an *endless loop*:

    while() {
    Yes! I was baffled by that construct, but if you run perl -E"while(){ say 'Hi' }" it's and endless loop.

    each iteration of which constructs two $N x $N matrices (he uses floats but no matter). So for $N = 2 they might look something like this:

    @a = [ [ 1, 2 ], [ 3, 4 ] ] @b = [ [ 5, 6 ], [ 7, 8 ] ]

    It then pushes (*shared*) arrays containing *copies* of each combination of pairs of rows from these two matrixs, plus their original positions, onto a work queue.

    $Q = [ 0, 0, [ 1, 2 ], [ 5, 6 ] ] [ 0, 1, [ 1, 2 ], [ 7, 8 ] ] [ 1, 0, [ 3, 4 ], [ 5, 6 ] ] [ 1, 1, [ 3, 4 ], [ 7, 8 ] ]

    So, back to those first 4 threads he started.

    They sit in governed, but also *endless* loops, reading from the work queue.

    When the get one of those Q elements above, they unpack the elements into local variables, and then copy the contents of both the subarrays (which are shared, but only ever processed by a single thread!), into *local (anonymous array!?) copies*

    It then iterates the two local subarrays in parallel, summing their products. It then construct a local unshared anonymous array containing the x & y from the original input, plus the sum of products. It then shares that anonymous array (which empties them!), before pushing it back on to a results queue.

    Which means he is pushing empty shared anonymous arrays onto the results queue?

    Now finally, the main queue sits reading from the results queue, ignoring the results but counting. Then after a (arbitrary) count of 63(!?), it starts a timer. Then it continues counting and discarding results until the count satisfies this condition:

    elsif (($count & 63) == 0) {
    and if at that point, at least 5 seconds have elapsed(!?)
    if (time > $t + 5) {

    it prints out a number that is some manipulation of the time it ran, the size of the matrixs and the count,

    printf "%f\n", $count / ($N * $N * (time - $t)); last;

    and exits with the 8 threads still running(!?).

    None of this makes any sense whatsoever.

    • He is not performing a matrix multiplication.

      At the very least he would have to

      1. Transform matrix B,
      2. Actually put the results onto the results queue.

        As is, he populates a non-shared anonymous array, and then shares it; which wipes the contents, meaning what he pushes into the results queue is an empty array

      3. Build a results matrix C from the results he calculates.

      But those are the least of the problems.

    • It doesn't make sense to split a matrix multiplication of two 50x50 matrixs across multiple threads! Much less break the calculation down by rows.

      A modern processor can complete the entire 2 * 50x50 multiplication (using the simplest naive algorithm), in much less than a single timeslice.

      And far less time than it takes to:

        Spawn a thread;
      1. Copy the data to shared memory;
      2. Queue that shared data;
      3. Retrieve that data;
      4. And copy it to local variables;
    • And what's up with all that copying?

      The whole point of shared data is that you can share it. There is no point in making data shared and then copying it to local variables to use it!

    • And why do the threads continue pumping data onto the queues?
    • Why does he never clean up the queues?
    • Why does he discard the first 63 results?
    • And what is that final value he is printing out?

      To the very best of my abilities to interpret, it is an near to a random value as I can discern.

      It is of absolutely no value whatsoever as a benchmark of threading!

    I'm glad that you noticed that Coro's main claim--Coro - the only real threads in perl--is bogus, because most people will not. iThreads are "real threads"; and all the more real because they can at least make use of multiple cores. Unlike (any flavour) of user-space cooperative threading.

    Just like most of them will take the benchmark results at face value, without ever looking closely enough to see that they are equally bogus.If you break a piece of data into iddy-biddy pieces, subject it to a bunch of unnecessary copying; and then farm it off to threads for manipulations that take far less than a single timeslice to process before copying them some more before queing them back to the main thread to re-assemble--whilst all the time continuing to fire more and more data onto the queues, that you are never going to process, but that will cause lock states that will slow everything down--then it will take longer than if you just used a single process. But that's not news!

    It is perfectly possible to utilise multiple cores, via multiple threads, to improve the performance of matrix multiplication--provided the matrix involved are sufficiently large to actually warrent it.

    But this is not the way to do it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Finally I had time to look through the source too and while the benchmark is probably hopelessly unfair to ithreads, not working correctly with ithreads and totally artificial, the program at least makes a lot more sense to me. It seems to be a synthetic benchmark of the Coro package (and as you will see this can explain a lot)

      The coro queues are limited to 512 entries. So after filling the first queue (which would happen if creation of the matrices is faster than the multiplication, what I assume) the 4 producer threads just keep the queue filled. Waiting a few rounds before timing is sensible because before the queue is filled the timing won't be stable.

      Measuring >5 seconds makes sense as well. You need a minimum time to get enough accuracy on the time measurement and since CPUs get faster with time you can't just use a fixed number of rounds.

      Coro also seems to stop all coroutines if one of them calls exit(), so no need to stop the threads when the main program finishes.

      The term $count / ($N * $N * (time - $t)) is how many multiplications (or matrix cells) per second the benchmark can process. Definitely not a random value

      It is a synthetic benchmark with random cell values. So if matrix b isn't transformed it really doesn't matter. Even the wiping of the result in the ithread version doesn't matter, but probably wasn't meant that way, either a bug or the wiping was simply unknown to the benchmark writer

      No question this algorithm doesn't make sense as a real world solution. As a synthetic benchmark of Coro it works, as a worst case example for ithreads probably too.

      I don't know what the copying is about. It makes sense to unshare data if you do a lot of operations on it because shared data is slower, but one mulitiplication isn't a lot. Maybe this was included to simulate some optimizations

      Cleaning up of the queues is not necessary because (again) this isn't a real solution to a problem but a synthetic benchmark. The only result anyone is interested in is the time it takes

        The term $count / ($N * $N * (time - $t)) is how many multiplications (or matrix cells) per second the benchmark can process.

        I thought that at first, but then I looked closer.

        while( my $r = $qr->dequeue ) { ++$count; if ($count == 63) { $t = time; }
        1. The count starts immediately, but the timer doesn't start until the count hits 63.

          At the very least that inflates the benchmark values a little.

        2. But then, it only checks the time each time the count hits a multiple of 64:
          elsif (($count & 63) == 0) { if (time > $t + 5) { printf "%f\n", $count / ($N * $N * (time - $t)); last; } } }

          Why would it do that? And the answer is, because it improves the performance of Coro!

          As Coro threads are cooperative, if the timing thread called the relatively expensive built-in time each iteration, the cpu used to process that opcode, would directly detract from the time the other threads spent doing multiplications.

          That operation effectively reduces the number of expensive calls to time by a factor of 64. As iThreads are preemptive, they don't need or benefit from the reduction in the number of calls to time, making it a performance multiplier for the Coro threads only. Talk about weighting the die.

        I don't know what the copying is about. It makes sense to unshare data if you do a lot of operations on it because shared data is slower, but one multiplication isn't a lot. Maybe this was included to simulate some optimizations.

        Hm. Sorry, but that doesn't make any sense at all.

        To multiply the 50 pairs of values, means accessing each shared value once. 100 shared accesses. To copy those values to non-shared memory means accessing each shared value once to copy them to non-shared memory--100 shared accesses. But additionally, you have to: a) allocate the non-shared arrays; b) copy the values to that non-shared memory; c) access the non-shared values once each to do the multiply.

        Ditto with the results. Instead of just writing them directly to shared memory, he 1) allocates non-shared; 2) writes to non-shared; 3) allocates shared; 4) copies non-shared to shared.

        So, to avoid 100 'slow operations'; he does: 3*100 allocations; 100 read from shared (the 'slow operations' he was trying to avoid!); 100 writes to non-shared; 200 reads from non-shared; 100 writes to shared. Not so much of an optimisation.

        And note. All of these shenanigans only happens on the iThreads side of the benchmark.

        Cleaning up of the queues is not necessary because (again) this isn't a real solution to a problem but a synthetic benchmark.

        But that (again) is a totally Coro-biased view of things.

        With 4 (preemptive) threads continually generating matrices--breaking them up into chunks (that will never be processed); sharing them and firing them into a shared queue (injecting sync points into all the threads that have visibility of that queue; ie. all of them)--they are directly competing for the CPUs with the 4 threads that are meant to be doing the work.

        Continually adding more and more data to the queue that's never going to be processed, means constantly reallocating and copying the underlying shared array, as the queue size doubles and redoubles in size. It's like timing how long it takes to stick labels on boxes whilst the operator is continuously having to unload and reload the boxes he's already done onto bigger and bigger lorries. The time taken to affix the labels is entirely lost in the noise of everything else he is having to do.

        And again, the bias is all in favour of Coro, because with limited size queues the Coro generating threads will have blocked long before he ever starts timing. Ie. 63 * 50 = 3,150 > 512

        The only result anyone is interested in is the time it takes.

        In that case, I offer:

        perl -E"$s=time(); $i=0; ++$c, rand()*rand() while time() < $s+5; say +$c/5" 3210038.8

        Let's see Coro compete with that! It's just as (un)fair a comparison as this benchmark.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.