in reply to Re^8: (Innuendo and guesswork)
in thread Using kernel-space threads with Perl
Innuendo. You suggest "mistakes", but typically don't identify them so they cannot be countered. There is one trivial mistake, -l, that has exactly NO effect on the performance.
The one mistake that had a huge impact on performance I did identify quite specifically. The other mistakes indeed made no impact on performance (I never said otherwise) so there was no need for them to be "countered".
It seems that your explanation is that "lock contention" causes a non-linear "slow down" in which a huge increase in CPU usage is the result. So Thead::Queue is using spin locks? If so, somebody should certainly fix that. If not, your theory makes no sense to me.
I suspect you just failed to understand the "non-linear slow-down" I was describing and so are explaining some other thing that you have dreamed up (like contention for the disk head). I certainly don't see how restarting via exec would change the amount of lock contention.
As for not posting any code, I used your code. Of course, I couldn't reproduce your results because you made that impossible by not posting how you created your mythical 3.5GB input file. So I only did worse than you in one way; I used posted code but posted no details about the input file to enable the test to be reproduced while you posted the rough size of the input file.
Right after I read your evidence that introducing Thread::Queue made processing go from "less than 2 minutes" to "~4 hours" (over 120x slower based on my stab at "math on vague numbers") I thought two things: 1) I didn't realize that Thread::Queue was so very badly implemented despite the good things people say about it; 2) Surely BrowserUk also used it badly for the performance penalty to be that ridiculous.
Regarding (1), I ran a very quick test to see how horrible Thread::Queue was and found that it wasn't horrible. So I ended up spending some time trying to guess how you had such radically different results.
Regarding (2), I saw that you were using a separate queue for each thread, so the contention for locks should be minimal. Indeed, I eventually showed that contention for locks had no significant role in the "just over 2"x CPU usage increase that I saw over and over.
I also saw you using a single output semaphore and using threads::shared for it. I tried replacing it with a semaphore that I knew to be efficient and also removing the semaphore. But there was no real lock contention there either so those made no real difference.
In wasting time trying to reverse engineer your input file, I obviously used much, much smaller input files. In multiple runs on multiple systems using different sizes of input and a couple of vastly different record sizes (always using 'a'x$N as the record, so that the processing would be somewhat less ridiculously trivial in stark contrast to the problem being discussed), I got a lot of runs all showing exactly what I described: multiple threads, whether in the same process or not, take slightly longer elapsed time to use just over twice as much CPU to accomplish the same thing.
Yes, I fixed two significant bugs in your code so that I could verify that my tests were actually getting the same work done. But those fixes made no difference in performance and so I didn't waste time writing them up. The mere act of verifying the output made those mistakes obvious so I had no doubt you'd easily find and fix them as trivially as I did (well, you found one, it seems). The third mistake (using s/a/A/ when not using threads but using s/a/A/g when using threads) made a /huge/ difference in performance (in my attempts to reverse engineer your mysterious input file) and would also have been trivially found had any testing been done.
It also makes no sense for the trivial, linear processing of a huge list of records to cause non-linear lock contention. Indeed, I posted the code I used to demonstrate that lock contention is not the source of the slow-downs I saw.
So, I had numerous iterations of bits of code, most of which were slightly modified versions of the code you posted. All of the modifications proved to have no significant impact on performance. So I saw no point in wasting time posting any of those variations. And I just followed your lead in not saying how I had generated my input files. Quite terrible of me.
For timing, I just used 'time' (from Cygwin when on Windows).
Having now seen "the quick brown fox" input, another explanation for the difference is that your example of "processing" was so ridiculously small that it is no wonder that even rather minor changes in the processing of the input completely swamp it. In contrast, I was doing trivial amounts of processing but not quite so ridiculously trivial. In quickly swapping each byte of each line, my "processing step" was not trivially small compared to the work of copying each line from the parent thread to a child thread.
The non-linear slow-down (CPU usage ramp up, really) that I noticed was that multiplying the size of the input file by $N made the CPU used by the Thread::Queue example go up by somewhat more than $N-fold. For the simple example, there is no reason this should be the case (except perhaps trivially so based on noise like memory fragmentation but I was looking at the ratio of CPU used for the Thread::Queue example vs. CPU used for the fixed "perl -pe" example).
I found that I could make the Thread::Queue example use a little bit less CPU be inserting a bunch of "sync all of the threads and then exec($^X,$0)" steps throughout the process. I believe that indicates something stupid is going on underneath. I'm glad to see that this isn't so bad as to account for the discrepancies in our numbers.
I've done a lot of work with lock contention. You seem to think it is close to a "given", something that /always/ happens when you split work between threads. That isn't the case. And when it happens, CPU usage goes /down/ (unless you are just using spin locks, of course, which then one should just fix).
There is work (having nothing to do with locks) in copying a record into a data structure and passing that data structure to a sister interpreter instance so it can copy that data into a buffer from it's thread-specific memory allocation pool. It seems that that work adds up to a lot more than s/a/A/g on a record that is mostly not 'a's. This doesn't seem surprising and doesn't require the invention of "lock contention" to explain it.
So, since you have revealed the input you used for these experiments and I have now revealed that I used 'a'x$N ($N didn't seem to matter much but 1024 was mostly what I used), you can probably reproduce my results.
It looks like it was all a tempest in a teapot caused by doing something trivial many millions of times so that minor changes can cause dramatic results. If one is trying to get non-trivial work done, the minor micro dis-optimizations add up to ho-hum differences in the large.
- tye
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^10: (Innuendo and guesswork)
by BrowserUk (Patriarch) on Mar 23, 2011 at 18:07 UTC | |
by tye (Sage) on Mar 23, 2011 at 23:15 UTC | |
by BrowserUk (Patriarch) on Mar 24, 2011 at 05:11 UTC | |
by ikegami (Patriarch) on Mar 24, 2011 at 18:48 UTC |