in reply to Re: If I am tied to a db and I join a thread, program chrashes
in thread If I am tied to a db and I join a thread, program chrashes

perl threads will not make your program faster on a CPU with more cores. Someone recently tested it and the perl implementation of threads is actually making it worse on multi-core machines in most cases

As posted, that is nothing but FUD!

  1. Who tested?
  2. What did they test?
  3. How did they test?
  4. "Most cases" of what?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re^2: If I am tied to a db and I join a thread, program chrashes

Replies are listed 'Best First'.
Re^3: If I am tied to a db and I join a thread, program chrashes
by jethro (Monsignor) on Jun 05, 2009 at 01:55 UTC

    The test was done by Marc Lehmann and he showed his results at the german perl workshop this year. Sadly his talk is not available online and I had to cite from memory when I answered. I have it before me now and can translate some points for you:

    1) Perls interpreter-threads are a sort of backport of ActiveStates implementation of forks through windows threads. The whole perl interpreter gets copied with all variables. Every function internally gets an additional parameter that tells perl where to find the variables (I guess he means for synchronising). This makes perl slower even if you don't use threads, makes it instable and doesn't work well will XS modules. There is no common address space, so you don't get any of the advantages of threads and have to pay the price of the synchronisation

    2) Threads don't work well in multi-core systems because every cpu has its own MMU and Cache. Because threads use resources together, all MMUs and caches have to be synchronized often. For example if a thread allocates memory, every cpu has to be halted and their state synchronized. Perls thread implementation doesn't do that (see above), but pays with the additional indirection on every variable access which costs 15 to 200% compared to a perl without thread support (even when not using threads).

    3) Marc did tests with a matrix multiplication (selected because it uses much variable sharing). Slowest was the one with 4 interpreter-threads on a quad-core machine. 20 times faster was the same 4 interpreter threads on a single core(!). 300 times faster than the interpreter threads was an implementation of cooperative/non-preemptive threads (Coro, written by Marc Lehmann) on a single core.

    To answer your question 4 now, perls interpreter threads seem not to work well on multi-cores in those cases where they actually make extensive use of the sharing of their variables (that is if Marcs results are not fake, fabricated or erroneous(sp?)). Some of his points you can read in his documentation to Coro if you are interested

      Firstly, thank you for your prompt and detailed response.

      Secondly, your sweeping generalisation, "Better use real processes", is incorrect--even if everything Marc Lehmann said in his talk is 100% accurate. It is (all but) impossible to parallelise matrix multiplication using "real processes" alone.

      Marc Lehmann achieves his results by using threads. Albeit that they are a user space implementation of cooperative threading, it is still threading. The choice for the parallelisation of Perl programs across multiple cores, is not between 'using threads' and 'using processes', it is between using the ithreads implementation, or Coro threads implementation.

      Now we've established that a form of threading is required!

      Let's discuss the merits of the two implementations. I'm not a fan of the iThreads implementation. The attempt to emulate fork on windows is mostly unusable, and the artifacts that emulation attempt imposes upon the use of threading are costly and frustrating. But removing them at this stage is not an option, so it is a case of working within the limitations of what we have. The same can be said about many other areas of Perl. And if you work within those limitations, iThreads are

      1. Available out-of-the-box everywhere.

        Everywhere that hasn't explicitly chosen to eschew them that is.

      2. A simple API. All the standard Perl functionality 'just works|'.

        You don't need special implementations of: IO, select, timers, LWP, Storable et al.

      3. Very easy to use.

        For a whole raft of 'let me do something else whilst this piece of code runs' applications.

      4. Easier to develop and test than the alternatives (by a country mile!).

        This is especially true for casual multi-tasker's who don't want to get into the nitty-gritty of Amdahl's Law, much less it's limitations as addressed by Gustafson's law.

        They want to be able to write the programs just as they always have for single tasking; and then run either several copies of one algorithm, or several algorithms concurrently to make use of the multiple cores that are now ubiquitous. And iThreads allows them to do that. Today, out-of-the-box with no need to learn complex specialist programming techniques to break up and interleave their algorithms into iddy-biddy chunks.

        They do not care whether they get 2x or only 1.75x performance from a dual core; or only 2.5 rather than 3x on a triple core; or only just 3x on a quad core. What they do care about is that whatever number of cores their code finds itself running on, they will get an appropriate benefit from them, without having to go through a complex tuning process for each and every cpu type.

      Coro only wins (perhaps1), on one particular class of parallelisation tasks. That of cpu-intensive algorithms running over promiscuously shared data. But this is only one class of parallelisation task, and a relatively rare one at that. And then only if the programmer is well versed in the needs and vagaries of tuning user-space cooperative threading. And that is no simple task as anyone who used or programmed Windows'95 will tell you!

      The example given is that of matrix multiplication, and that possibly gives an indication of why Marc Lehmann's benchmark apparently shows iThreads in such a bad light. There is no need for promiscuously shared data (and the associated locking) with matrix multiplication!. So if Marc's iThreads MM implementation does the naive thing of applying locks & syncing to all three arrays, then it is no wonder that it runs slowly. But that is user error!

      1: I've done a brief search of both the Coro package and the web in general, and I have been unable to locate Marc Lehmann's benchmark code (despite that it is basis of the packages primary 'claim to fame'). So, I've been unable to verify my speculation about it. If the code is available anywhere I would be happy to review it and correct my speculations if they turn out to be unfounded!

      But in the end, anyone doing serious matrix manipulations where ultimate performance is the primary requirement, probably isn't going to be using Perl! And if they are, simply dropping into PDL to do those manipulations will probably gain them far more performance than hand tuning a Coro implementation.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        After having some time to think about what I wrote yesterday I too realized that the conclusion one can draw from the matrix multiplication example is different: "Don't parallelize communication heavy algorithms". Still, it shows that the sharing has its costs (as you too point out). One result from Marcs tests was that the Coro implementation running on one core was 25-35% faster with a perl without ithread support. I.e. everyone with a standard perl installation is paying this price with or without using threads.

        It is (all but) impossible to parallelise matrix multiplication using "real processes" alone

        I don't think so. If you can afford 4 times the memory usage a process implementation will be nearly 4 times faster on a quad-core. Every process uses its own non-shared memory to calculate one fourth of the rows.

        Marc Lehmann achieves his results by using threads. Albeit that they are a user space implementation of cooperative threading, it is still threading ...

        Now we've established that a form of threading is required!

        Note that his Coro threads don't have any parallelization potential for multi-cores. They should run exactly the same on one core vs. a quad-core. He could have used a simple non-threaded matrix-multiplication and it would have run exactly the same. His Coro threads have two uses as I see it: Making it easier to program independent tasks in a program and speeding up programs where I/O, user input etc. provide opportunities for parallel tasks. They don't help with multi-cores at all

        This is the reason Marc Lehmann only used a single core to benchmark his Coro-implementation. And therefore your conclusion is wrong. Threads are not required

        To answer your question about the code, his implementation of the benchmark can be found at http://data.plan9.de/matmult. Please check it out, I will try to do it tonight when I have some time.

      Esteemed Jethro: Well, since I am not using thread::shared that probably explains why i do get a speed up. BUT nothing like 300 times. I am basically doing lots of vector dot products. On my test code running with one thread takes 10 sec, with 4 threads it takes 2 sec, but my CPU usage is only up to 50% so i might could go faster.

      For matrix multiplication i don't think i could ever beat the times of Math::GSL::Blas which have to be about the most optimized routines ever. So it sort of makes sense to me that trying to speed it up with shared data and threads wouldn't work.

      Unfortunately to get my full matrices in memory i would need more than the 8 gig available, hence my plan to break up the data, do the dots with hashes which gets rid of the zeros, and do them in threads. Maybe not optimal, but it should get my .cgi scripts down to an acceptable time. hopefully.

      Thanks for the Lehmann stuff, it was interesting and helpful.

        Well, as you can see from the ensuing discussion my first post was really overdramatizing what I had heard and I don't know as much about the topic as I should

        Obviously the ~300x penalty for using ithreads only comes into play with heavy sharing of variables, which I was wrongly assuming to be happening with the tied hashes of MLDBM.

        Instead in your case (with the disk based data) there are a lot of io waits that can be used by other threads/processes to continue and this is ideal for parallelization. So I would be very interested to know whether your speedup really comes from the quad-core. If I'm correct with my guess, your program with 4 threads would run nearly as fast or even faster with one core than with four (isn't there a way to turn of cores in the BIOS or somewhere?). And if that really is the case, there might be another speedup possible with Marcs Coro package.

      Many-core Engine for Perl (MCE) comes with several examples demonstrating matrix multiplication in parallel across many cores. The readme contains benchmark results as well.