in reply to Re^2: If I am tied to a db and I join a thread, program chrashes
in thread If I am tied to a db and I join a thread, program chrashes

The test was done by Marc Lehmann and he showed his results at the german perl workshop this year. Sadly his talk is not available online and I had to cite from memory when I answered. I have it before me now and can translate some points for you:

1) Perls interpreter-threads are a sort of backport of ActiveStates implementation of forks through windows threads. The whole perl interpreter gets copied with all variables. Every function internally gets an additional parameter that tells perl where to find the variables (I guess he means for synchronising). This makes perl slower even if you don't use threads, makes it instable and doesn't work well will XS modules. There is no common address space, so you don't get any of the advantages of threads and have to pay the price of the synchronisation

2) Threads don't work well in multi-core systems because every cpu has its own MMU and Cache. Because threads use resources together, all MMUs and caches have to be synchronized often. For example if a thread allocates memory, every cpu has to be halted and their state synchronized. Perls thread implementation doesn't do that (see above), but pays with the additional indirection on every variable access which costs 15 to 200% compared to a perl without thread support (even when not using threads).

3) Marc did tests with a matrix multiplication (selected because it uses much variable sharing). Slowest was the one with 4 interpreter-threads on a quad-core machine. 20 times faster was the same 4 interpreter threads on a single core(!). 300 times faster than the interpreter threads was an implementation of cooperative/non-preemptive threads (Coro, written by Marc Lehmann) on a single core.

To answer your question 4 now, perls interpreter threads seem not to work well on multi-cores in those cases where they actually make extensive use of the sharing of their variables (that is if Marcs results are not fake, fabricated or erroneous(sp?)). Some of his points you can read in his documentation to Coro if you are interested

  • Comment on Re^3: If I am tied to a db and I join a thread, program chrashes

Replies are listed 'Best First'.
Re^4: If I am tied to a db and I join a thread, program chrashes
by BrowserUk (Patriarch) on Jun 05, 2009 at 04:21 UTC

    Firstly, thank you for your prompt and detailed response.

    Secondly, your sweeping generalisation, "Better use real processes", is incorrect--even if everything Marc Lehmann said in his talk is 100% accurate. It is (all but) impossible to parallelise matrix multiplication using "real processes" alone.

    Marc Lehmann achieves his results by using threads. Albeit that they are a user space implementation of cooperative threading, it is still threading. The choice for the parallelisation of Perl programs across multiple cores, is not between 'using threads' and 'using processes', it is between using the ithreads implementation, or Coro threads implementation.

    Now we've established that a form of threading is required!

    Let's discuss the merits of the two implementations. I'm not a fan of the iThreads implementation. The attempt to emulate fork on windows is mostly unusable, and the artifacts that emulation attempt imposes upon the use of threading are costly and frustrating. But removing them at this stage is not an option, so it is a case of working within the limitations of what we have. The same can be said about many other areas of Perl. And if you work within those limitations, iThreads are

    1. Available out-of-the-box everywhere.

      Everywhere that hasn't explicitly chosen to eschew them that is.

    2. A simple API. All the standard Perl functionality 'just works|'.

      You don't need special implementations of: IO, select, timers, LWP, Storable et al.

    3. Very easy to use.

      For a whole raft of 'let me do something else whilst this piece of code runs' applications.

    4. Easier to develop and test than the alternatives (by a country mile!).

      This is especially true for casual multi-tasker's who don't want to get into the nitty-gritty of Amdahl's Law, much less it's limitations as addressed by Gustafson's law.

      They want to be able to write the programs just as they always have for single tasking; and then run either several copies of one algorithm, or several algorithms concurrently to make use of the multiple cores that are now ubiquitous. And iThreads allows them to do that. Today, out-of-the-box with no need to learn complex specialist programming techniques to break up and interleave their algorithms into iddy-biddy chunks.

      They do not care whether they get 2x or only 1.75x performance from a dual core; or only 2.5 rather than 3x on a triple core; or only just 3x on a quad core. What they do care about is that whatever number of cores their code finds itself running on, they will get an appropriate benefit from them, without having to go through a complex tuning process for each and every cpu type.

    Coro only wins (perhaps1), on one particular class of parallelisation tasks. That of cpu-intensive algorithms running over promiscuously shared data. But this is only one class of parallelisation task, and a relatively rare one at that. And then only if the programmer is well versed in the needs and vagaries of tuning user-space cooperative threading. And that is no simple task as anyone who used or programmed Windows'95 will tell you!

    The example given is that of matrix multiplication, and that possibly gives an indication of why Marc Lehmann's benchmark apparently shows iThreads in such a bad light. There is no need for promiscuously shared data (and the associated locking) with matrix multiplication!. So if Marc's iThreads MM implementation does the naive thing of applying locks & syncing to all three arrays, then it is no wonder that it runs slowly. But that is user error!

    1: I've done a brief search of both the Coro package and the web in general, and I have been unable to locate Marc Lehmann's benchmark code (despite that it is basis of the packages primary 'claim to fame'). So, I've been unable to verify my speculation about it. If the code is available anywhere I would be happy to review it and correct my speculations if they turn out to be unfounded!

    But in the end, anyone doing serious matrix manipulations where ultimate performance is the primary requirement, probably isn't going to be using Perl! And if they are, simply dropping into PDL to do those manipulations will probably gain them far more performance than hand tuning a Coro implementation.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      After having some time to think about what I wrote yesterday I too realized that the conclusion one can draw from the matrix multiplication example is different: "Don't parallelize communication heavy algorithms". Still, it shows that the sharing has its costs (as you too point out). One result from Marcs tests was that the Coro implementation running on one core was 25-35% faster with a perl without ithread support. I.e. everyone with a standard perl installation is paying this price with or without using threads.

      It is (all but) impossible to parallelise matrix multiplication using "real processes" alone

      I don't think so. If you can afford 4 times the memory usage a process implementation will be nearly 4 times faster on a quad-core. Every process uses its own non-shared memory to calculate one fourth of the rows.

      Marc Lehmann achieves his results by using threads. Albeit that they are a user space implementation of cooperative threading, it is still threading ...

      Now we've established that a form of threading is required!

      Note that his Coro threads don't have any parallelization potential for multi-cores. They should run exactly the same on one core vs. a quad-core. He could have used a simple non-threaded matrix-multiplication and it would have run exactly the same. His Coro threads have two uses as I see it: Making it easier to program independent tasks in a program and speeding up programs where I/O, user input etc. provide opportunities for parallel tasks. They don't help with multi-cores at all

      This is the reason Marc Lehmann only used a single core to benchmark his Coro-implementation. And therefore your conclusion is wrong. Threads are not required

      To answer your question about the code, his implementation of the benchmark can be found at http://data.plan9.de/matmult. Please check it out, I will try to do it tonight when I have some time.

        I've spent 2 days staring and thinking about that benchmark and it is really hard not to be horribly scathing about it. But I don't want to do that because Coro is a very clever piece of code by an obviously very bright guy.

        Still, the best conclusion I can draw from the benchmark is that if you do the same very strange and silly thing with both iThreads and Coro, Coro will do it more quickly. Which is hardly great recommendation. It would be all too easy to draw other, even less flattering conclusions regarding the purpose of the benchmark.

        To explain what I mean, lets examine what the benchmark code does. To this end, I'll use this version which just removes all the Coro code so that we can concentrate on what it actually does.

        It starts $T threads. We'll come back to these shortly.

        It then starts $T more threads, each of which goes into an *endless loop*:

        while() {
        Yes! I was baffled by that construct, but if you run perl -E"while(){ say 'Hi' }" it's and endless loop.

        each iteration of which constructs two $N x $N matrices (he uses floats but no matter). So for $N = 2 they might look something like this:

        @a = [ [ 1, 2 ], [ 3, 4 ] ] @b = [ [ 5, 6 ], [ 7, 8 ] ]

        It then pushes (*shared*) arrays containing *copies* of each combination of pairs of rows from these two matrixs, plus their original positions, onto a work queue.

        $Q = [ 0, 0, [ 1, 2 ], [ 5, 6 ] ] [ 0, 1, [ 1, 2 ], [ 7, 8 ] ] [ 1, 0, [ 3, 4 ], [ 5, 6 ] ] [ 1, 1, [ 3, 4 ], [ 7, 8 ] ]

        So, back to those first 4 threads he started.

        They sit in governed, but also *endless* loops, reading from the work queue.

        When the get one of those Q elements above, they unpack the elements into local variables, and then copy the contents of both the subarrays (which are shared, but only ever processed by a single thread!), into *local (anonymous array!?) copies*

        It then iterates the two local subarrays in parallel, summing their products. It then construct a local unshared anonymous array containing the x & y from the original input, plus the sum of products. It then shares that anonymous array (which empties them!), before pushing it back on to a results queue.

        Which means he is pushing empty shared anonymous arrays onto the results queue?

        Now finally, the main queue sits reading from the results queue, ignoring the results but counting. Then after a (arbitrary) count of 63(!?), it starts a timer. Then it continues counting and discarding results until the count satisfies this condition:

        elsif (($count & 63) == 0) {
        and if at that point, at least 5 seconds have elapsed(!?)
        if (time > $t + 5) {

        it prints out a number that is some manipulation of the time it ran, the size of the matrixs and the count,

        printf "%f\n", $count / ($N * $N * (time - $t)); last;

        and exits with the 8 threads still running(!?).

        None of this makes any sense whatsoever.

        • He is not performing a matrix multiplication.

          At the very least he would have to

          1. Transform matrix B,
          2. Actually put the results onto the results queue.

            As is, he populates a non-shared anonymous array, and then shares it; which wipes the contents, meaning what he pushes into the results queue is an empty array

          3. Build a results matrix C from the results he calculates.

          But those are the least of the problems.

        • It doesn't make sense to split a matrix multiplication of two 50x50 matrixs across multiple threads! Much less break the calculation down by rows.

          A modern processor can complete the entire 2 * 50x50 multiplication (using the simplest naive algorithm), in much less than a single timeslice.

          And far less time than it takes to:

            Spawn a thread;
          1. Copy the data to shared memory;
          2. Queue that shared data;
          3. Retrieve that data;
          4. And copy it to local variables;
        • And what's up with all that copying?

          The whole point of shared data is that you can share it. There is no point in making data shared and then copying it to local variables to use it!

        • And why do the threads continue pumping data onto the queues?
        • Why does he never clean up the queues?
        • Why does he discard the first 63 results?
        • And what is that final value he is printing out?

          To the very best of my abilities to interpret, it is an near to a random value as I can discern.

          It is of absolutely no value whatsoever as a benchmark of threading!

        I'm glad that you noticed that Coro's main claim--Coro - the only real threads in perl--is bogus, because most people will not. iThreads are "real threads"; and all the more real because they can at least make use of multiple cores. Unlike (any flavour) of user-space cooperative threading.

        Just like most of them will take the benchmark results at face value, without ever looking closely enough to see that they are equally bogus.If you break a piece of data into iddy-biddy pieces, subject it to a bunch of unnecessary copying; and then farm it off to threads for manipulations that take far less than a single timeslice to process before copying them some more before queing them back to the main thread to re-assemble--whilst all the time continuing to fire more and more data onto the queues, that you are never going to process, but that will cause lock states that will slow everything down--then it will take longer than if you just used a single process. But that's not news!

        It is perfectly possible to utilise multiple cores, via multiple threads, to improve the performance of matrix multiplication--provided the matrix involved are sufficiently large to actually warrent it.

        But this is not the way to do it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^4: If I am tied to a db and I join a thread, program chrashes
by lance0r (Novice) on Jun 05, 2009 at 06:42 UTC
    Esteemed Jethro: Well, since I am not using thread::shared that probably explains why i do get a speed up. BUT nothing like 300 times. I am basically doing lots of vector dot products. On my test code running with one thread takes 10 sec, with 4 threads it takes 2 sec, but my CPU usage is only up to 50% so i might could go faster.

    For matrix multiplication i don't think i could ever beat the times of Math::GSL::Blas which have to be about the most optimized routines ever. So it sort of makes sense to me that trying to speed it up with shared data and threads wouldn't work.

    Unfortunately to get my full matrices in memory i would need more than the 8 gig available, hence my plan to break up the data, do the dots with hashes which gets rid of the zeros, and do them in threads. Maybe not optimal, but it should get my .cgi scripts down to an acceptable time. hopefully.

    Thanks for the Lehmann stuff, it was interesting and helpful.

      Well, as you can see from the ensuing discussion my first post was really overdramatizing what I had heard and I don't know as much about the topic as I should

      Obviously the ~300x penalty for using ithreads only comes into play with heavy sharing of variables, which I was wrongly assuming to be happening with the tied hashes of MLDBM.

      Instead in your case (with the disk based data) there are a lot of io waits that can be used by other threads/processes to continue and this is ideal for parallelization. So I would be very interested to know whether your speedup really comes from the quad-core. If I'm correct with my guess, your program with 4 threads would run nearly as fast or even faster with one core than with four (isn't there a way to turn of cores in the BIOS or somewhere?). And if that really is the case, there might be another speedup possible with Marcs Coro package.

        Esteemed jethro, i appreciate your thoughts. i will look at the Coro package. Meanwhile, i have no io wait cuz the .db file is cached in memory from an earlier run. But i took a closer look at what was going on. I run a test code with no threads and it takes 94 sec. My KDE systems monitor is now set up to show CPU useage for each core separately with usr, sys and wait for each. With no threads only one CPU is used at 85% usr. With 4 threads the code takes 23 sec, and shows 85% usr on each of the 4 CPU's. It's hard to avoid the conclusion the threads are helping. Note the CPU usr time is 71 seconds showing the code is in fact running in parallel. I bet there is a much better way to do this. What i think i will do is just buy a new motherboard and 16 gig of memory, then just multiply a matrix times a vector and be done. But i have learned something, especially with your and esteemed clintons's help. thanks.

        I attach the two test programs with output, and i hope you will excuse the crude perl, i am only learning and i want to keep things clear so i don't get confused.

        #! /usr/bin/perl -w use strict; use threads; use MLDBM::Sync; use MLDBM qw(DB_File Storable); use Fcntl; use Benchmark; require '/home/silly/g/prep/Prep.pl'; my $file = "/home/silly/g/data/117373HOHsynTriNormCV.db"; my %syntrihash; tie %syntrihash,'MLDBM::Sync',$file, O_RDONLY or die "tie failed for d +b $!\n"; my $NumbOfKeys = keys %syntrihash; print "Number of Dots to compute: $NumbOfKeys\n"; my $t0 = new Benchmark; foreach my $synset (keys %syntrihash){ my $hashref = \%{$syntrihash{$synset}}; #for testing i just dot the hash with itself &hashValueDot($hashref,$hashref,my $dot); } untie %syntrihash; my $t1 = new Benchmark; my $td1 = timediff($t1, $t0); print STDERR "the code took:",timestr($td1,'all')," to do dots\n";
        silly@bluetit:~/perl/threads$ nothread.pl Number of Dots to compute: 117369 the code took:94 wallclock secs (80.81 usr 13.03 sys + 0.00 cusr 0.0 +0 csys = 93.84 CPU) to do dots
        Now with 4 threads:
        silly@bluetit:~/perl/threads$ cat thread.pl #! /usr/bin/perl -w use strict; use threads; use MLDBM::Sync; use MLDBM qw(DB_File Storable); use Fcntl; use Benchmark; my $t0 = new Benchmark; require '/home/silly/g/prep/Prep.pl'; my $file = "/home/silly/g/data/117373HOHsynTriNormCV.db"; my %syntrihash; tie %syntrihash,'MLDBM::Sync',$file, O_RDONLY or die "tie failed for d +b $!\n"; my $NumbOfKeys = keys %syntrihash; + print "Number of Dots to compute: $NumbOfKeys\n"; + my $EndSection1 = int($NumbOfKeys/4); + my $EndSection2 = $EndSection1 + int($NumbOfKeys/4); + my $EndSection3 = $EndSection2 + int($NumbOfKeys/4); + my @synsetPart1; + my @synsetPart2; + my @synsetPart3; + my @synsetPart4; + my $k = 0; + foreach my $ss (keys %syntrihash){ + if($k < $EndSection1){ $synsetPart1[$k] = $ss; } + elsif ($k < $EndSection2){ $synsetPart2[$k - $EndSection1] += $ss; } elsif ($k < $EndSection3){ $synsetPart3[$k - $EndSection2] += $ss; } else { $synsetPart4[$k - $EndSection3] = $ss; } + $k += 1; + } + untie %syntrihash; + my $t1 = new Benchmark; + my $td1 = timediff($t1, $t0); + print STDERR "the code took:",timestr($td1,'all')," to prep\n"; + $t0 = new Benchmark; + my $thr1 = threads->create({'context' => 'array'}, \&subthr1, "test1") +; my $thr2 = threads->create({'context' => 'array'}, \&subthr2, "test2") +; my $thr3 = threads->create({'context' => 'array'}, \&subthr3, "test3") +; my $thr4 = threads->create({'context' => 'array'}, \&subthr4, "test4") +; my %return1 = $thr1 -> join(); + my %return2 = $thr2 -> join(); + my %return3 = $thr3 -> join(); + my %return4 = $thr4 -> join(); + my %synsetDotHash; + $t1 = new Benchmark; + $td1 = timediff($t1, $t0); + print STDERR "the code took:",timestr($td1,'all')," to do threaded dot +s\n"; sub subthr1{ + tie %syntrihash,'MLDBM::Sync',$file, O_RDONLY or die "tie fail +ed for db $!\n"; foreach my $synset (@synsetPart1){ + my $hashref = \%{$syntrihash{$synset}}; + &hashValueDot($hashref,$hashref,my $dot); + $synsetDotHash{$synset} = $dot; + } + my ($message) = @_; + print "Thread Message is $message\n"; + return (%synsetDotHash); + } + sub subthr2{ + tie %syntrihash,'MLDBM::Sync',$file, O_RDONLY or die "tie fail +ed for db $!\n"; foreach my $synset (@synsetPart2){ my $hashref = \%{$syntrihash{$synset}}; &hashValueDot($hashref,$hashref,my $dot); $synsetDotHash{$synset} = $dot; } my ($message) = @_; print "Thread Message is $message\n"; return (%synsetDotHash); } sub subthr3{ tie %syntrihash,'MLDBM::Sync',$file, O_RDONLY or die "tie fail +ed for db $!\n"; foreach my $synset (@synsetPart3){ my $hashref = \%{$syntrihash{$synset}}; &hashValueDot($hashref,$hashref,my $dot); $synsetDotHash{$synset} = $dot; } my ($message) = @_; print "Thread Message is $message\n"; return (%synsetDotHash); } sub subthr4{ tie %syntrihash,'MLDBM::Sync',$file, O_RDONLY or die "tie fail +ed for db $!\n"; foreach my $synset (@synsetPart4){ my $hashref = \%{$syntrihash{$synset}}; &hashValueDot($hashref,$hashref,my $dot); $synsetDotHash{$synset} = $dot; } my ($message) = @_; print "Thread Message is $message\n"; return (%synsetDotHash); }
        silly@bluetit:~/perl/threads$ thread.pl Number of Dots to compute: 117369 the code took: 2 wallclock secs ( 1.92 usr 0.67 sys + 0.00 cusr 0.0 +0 csys = 2.59 CPU) to prep Thread Message is test2 Thread Message is test3 Thread Message is test1 Thread Message is test4 the code took:23 wallclock secs (70.72 usr 12.29 sys + 0.00 cusr 0.0 +0 csys = 83.01 CPU) to do threaded dots
Re^4: If I am tied to a db and I join a thread, program chrashes
by marioroy (Prior) on Feb 18, 2013 at 23:24 UTC

    Many-core Engine for Perl (MCE) comes with several examples demonstrating matrix multiplication in parallel across many cores. The readme contains benchmark results as well.