Re^2: If I am tied to a db and I join a thread, program chrashes
by BrowserUk (Patriarch) on Jun 04, 2009 at 13:04 UTC
|
perl threads will not make your program faster on a CPU with more cores. Someone recently tested it and the perl implementation of threads is actually making it worse on multi-core machines in most cases
As posted, that is nothing but FUD!
- Who tested?
- What did they test?
- How did they test?
- "Most cases" of what?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
The test was done by Marc Lehmann and he showed his results at the german perl workshop this year. Sadly his talk is not available online and I had to cite from memory when I answered. I have it before me now and can translate some points for you:
1) Perls interpreter-threads are a sort of backport of ActiveStates implementation of forks through windows threads. The whole perl interpreter gets copied with all variables. Every function internally gets an additional parameter that tells perl where to find the variables (I guess he means for synchronising). This makes perl slower even if you don't use threads, makes it instable and doesn't work well will XS modules. There is no common address space, so you don't get any of the advantages of threads and have to pay the price of the synchronisation
2) Threads don't work well in multi-core systems because every cpu has its own MMU and Cache. Because threads use resources together, all MMUs and caches have to be synchronized often. For example if a thread allocates memory, every cpu has to be halted and their state synchronized. Perls thread implementation doesn't do that (see above), but pays with the additional indirection on every variable access which costs 15 to 200% compared to a perl without thread support (even when not using threads).
3) Marc did tests with a matrix multiplication (selected because it uses much variable sharing). Slowest was the one with 4 interpreter-threads on a quad-core machine. 20 times faster was the same 4 interpreter threads on a single core(!). 300 times faster than the interpreter threads was an implementation of cooperative/non-preemptive threads (Coro, written by Marc Lehmann) on a single core.
To answer your question 4 now, perls interpreter threads seem not to work well on multi-cores in those cases where they actually make extensive use of the sharing of their variables (that is if Marcs results are not fake, fabricated or erroneous(sp?)). Some of his points you can read in his documentation to Coro if you are interested
| [reply] |
|
|
Firstly, thank you for your prompt and detailed response.
Secondly, your sweeping generalisation, "Better use real processes", is incorrect--even if everything Marc Lehmann said in his talk is 100% accurate. It is (all but) impossible to parallelise matrix multiplication using "real processes" alone.
Marc Lehmann achieves his results by using threads. Albeit that they are a user space implementation of cooperative threading, it is still threading. The choice for the parallelisation of Perl programs across multiple cores, is not between 'using threads' and 'using processes', it is between using the ithreads implementation, or Coro threads implementation.
Now we've established that a form of threading is required!
Let's discuss the merits of the two implementations. I'm not a fan of the iThreads implementation. The attempt to emulate fork on windows is mostly unusable, and the artifacts that emulation attempt imposes upon the use of threading are costly and frustrating. But removing them at this stage is not an option, so it is a case of working within the limitations of what we have. The same can be said about many other areas of Perl. And if you work within those limitations, iThreads are
- Available out-of-the-box everywhere.
Everywhere that hasn't explicitly chosen to eschew them that is.
- A simple API. All the standard Perl functionality 'just works|'.
You don't need special implementations of: IO, select, timers, LWP, Storable et al.
- Very easy to use.
For a whole raft of 'let me do something else whilst this piece of code runs' applications.
- Easier to develop and test than the alternatives (by a country mile!).
This is especially true for casual multi-tasker's who don't want to get into the nitty-gritty of Amdahl's Law, much less it's limitations as addressed by Gustafson's law.
They want to be able to write the programs just as they always have for single tasking; and then run either several copies of one algorithm, or several algorithms concurrently to make use of the multiple cores that are now ubiquitous. And iThreads allows them to do that. Today, out-of-the-box with no need to learn complex specialist programming techniques to break up and interleave their algorithms into iddy-biddy chunks.
They do not care whether they get 2x or only 1.75x performance from a dual core; or only 2.5 rather than 3x on a triple core; or only just 3x on a quad core. What they do care about is that whatever number of cores their code finds itself running on, they will get an appropriate benefit from them, without having to go through a complex tuning process for each and every cpu type.
Coro only wins (perhaps1), on one particular class of parallelisation tasks. That of cpu-intensive algorithms running over promiscuously shared data. But this is only one class of parallelisation task, and a relatively rare one at that. And then only if the programmer is well versed in the needs and vagaries of tuning user-space cooperative threading. And that is no simple task as anyone who used or programmed Windows'95 will tell you!
The example given is that of matrix multiplication, and that possibly gives an indication of why Marc Lehmann's benchmark apparently shows iThreads in such a bad light. There is no need for promiscuously shared data (and the associated locking) with matrix multiplication!. So if Marc's iThreads MM implementation does the naive thing of applying locks & syncing to all three arrays, then it is no wonder that it runs slowly. But that is user error!
1: I've done a brief search of both the Coro package and the web in general, and I have been unable to locate Marc Lehmann's benchmark code (despite that it is basis of the packages primary 'claim to fame'). So, I've been unable to verify my speculation about it. If the code is available anywhere I would be happy to review it and correct my speculations if they turn out to be unfounded!
But in the end, anyone doing serious matrix manipulations where ultimate performance is the primary requirement, probably isn't going to be using Perl! And if they are, simply dropping into PDL to do those manipulations will probably gain them far more performance than hand tuning a Coro implementation.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
|
|
|
Esteemed Jethro: Well, since I am not using thread::shared that probably explains why i do get a speed up. BUT nothing like 300 times.
I am basically doing lots of vector dot products. On my test code running with one thread takes 10 sec, with 4 threads it takes 2 sec, but my CPU usage is only up to 50% so i might could go faster.
For matrix multiplication i don't think i could ever beat the times of
Math::GSL::Blas which have to be about the most optimized routines ever. So it sort of makes sense to me that trying to speed it up with shared data and threads wouldn't work.
Unfortunately to get my full matrices in memory i would need more than the 8 gig available, hence my plan to break up the data, do the dots with hashes which gets rid of the zeros, and do them in threads. Maybe not optimal, but it should get my .cgi scripts down to an acceptable time. hopefully.
Thanks for the Lehmann stuff, it was interesting and helpful.
| [reply] |
|
|
|
|
|
|
|
| [reply] |
Re^2: If I am tied to a db and I join a thread, program chrashes
by lance0r (Novice) on Jun 04, 2009 at 18:04 UTC
|
Thank you esteemed jethro for your insight. I did test 4 threads running the same code at the same time (obviously with no tie to a database) and while memory use went up, the total time was about 1.5 times a single run, not the expected 4 times longer i would have without threads. CPU usage went up to 95%+ instead of 25% that was used without threads.
Plus I am back on track thanks to monk clinton, who told me to tie to the database inside the thread, one tie for each thread. Seems to work so far.
And thanks for the tip on MLDBM::Sync. I am using it without problems so far. lance
| [reply] |
|
|
Hello Monks and lance0r
I am facing the problem with a similar situation but when I want to connect with different schemas on the same machine. Actually I need to run perl reports but need to run them parallel with different schemas. Would that thread make any problem? Initially i was planning to use fork() inbuilt function but then we changed to 'use thread' but now its giving me segmentation fault error.
Can anyone please help me out here?
| [reply] |
|
|
If your question is about DBI, read the discussion of Threads and Thread Safety. If it's not about DBI, it's even less clear to me about what your question is. But in general, you will have much more luck in launching separate programs instead of trying to use fork() or threads to run DBI things in parallel.
| [reply] |