Re^8: (Innuendo and guesswork)

checked your work and found several mistakes, at least one of which likely would drastically change your numbers.

Innuendo. You suggest "mistakes", but typically don't identify them so they cannot be countered.

There is one trivial mistake, -l, that has exactly NO effect on the performance.

My investigation shows ...

Typically, no code for others to try; no indication of your testing methodology; nothing but your words in support of your claims.

If you never post anything but 'authoritative testimony', you cannot be countered. Way to go with maintaining your mystique.

The non-linear slow-down of Thread::Queue is interesting and might be worth investigating.... I wish I had 4 cores to play with.

And right there in a nutshell is the endemic problem: Guesswork. Of course there is a non-linear slowdown.

But until you've used threads & queues in earnest a few times, the reason probably won't be obvious to you. Namely: lock-contention & context thrash.

Simplistically, the lock contention arises when one end of a given queue attempts to use it, when it is currently in use by the other end. So, when one of the worker threads is running on its "own" core reading from the queue, that queue is almost permanently locked. It only becomes unlocked for a brief period at the end of each loop cycle.

That means that when the file reading thread comes to write a record to that queue, it will, with a high probability, find that the queue is locked, and so enter a wait-state. That wait-state inevitably means that some other thread (or process) will inherit the core it was using and so a context switch occurs. The reader will not get another timeslice until at least that thread relinquishes it. And, it is entirely possible that it won't get a look in until several other threads have had their turn. And when it does, it may well find the queue still blocked and the cycle repeats.

Now, it should be obvious to even you how those delays could add up to produce the slow downs I indicated. And why all current parallelisation research is looking into lock-free data-structures and mechanisms.

To demonstrate, using this trivially modified form of the code above:

#! perl -sw
use strict;
use threads;
use Thread::Queue;
use Time::HiRes qw[ time ];

sub worker {
    my $Q = shift;
    while( my $line = $Q->dequeue ) {
         $line =~ s[a][A]g;
         print $line;
    }
}

print "Started " . scalar locatime;
our $T //= 4;
my $n = $T -1;

my @Qs = map Thread::Queue->new, 0 .. $n;

my @threads = map{
    threads->create( \&worker, $Qs[ $_ ] )
} 0 .. $n;

my $start = time;
my $i = 0;
while( <> ) {
    $Qs[ $i ]->enqueue( $_ );
    $i = ( $i + 1 ) % $T;
    warn $., ' ', $. / ( time - $start ) unless $. % 1e6;
}

$Qs[ $_ ]->enqueue( undef ) for 0 .. $n;

$_->join for @threads;

print "Ended " . scalar locatime;
[download]

and a 1 million line file produced for the purpose:

perl -e"printf qq[line %6d:the quick brown fox jumps over the lazy dog
+\n], $_ for 1 .. 1e6" >phrases.small

C:\test>dir phrases.small
...
23/03/2011  03:19        57,000,000 phrases.small
...
[download]

Now the runs using 1, 2, 3, & 4 threads and queues:

C:\test>junk71 -T=1 phrases.small >out.txt
Started Wed Mar 23 03:19:58 2011
1000000 19484.8213140505 at C:\test\junk71.pl line 30, <> line 1000000
+.
Ended Wed Mar 23 03:20:51 2011

C:\test>junk71 -T=2 phrases.small >out.txt
Started Wed Mar 23 03:20:59 2011
1000000 18001.4760827208 at C:\test\junk71.pl line 30, <> line 1000000
+.
Ended Wed Mar 23 03:21:55 2011

C:\test>junk71 -T=3 phrases.small >out.txt
Started Wed Mar 23 03:27:32 2011
1000000 15906.3434420465 at C:\test\junk71.pl line 30, <> line 1000000
+.
Ended Wed Mar 23 03:28:35 2011

C:\test>junk71 -T=4 phrases.small >out.txt
Started Wed Mar 23 03:30:32 2011
1000000 15088.645826608 at C:\test\junk71.pl line 30, <> line 1000000.
Ended Wed Mar 23 03:31:38 2011
[download]

And now for a single-core, sequential run:

perl -MTime::HiRes=time 
-E"BEGIN{warn $t=time, }" 
-pe"s[a][A]g; }{ warn $./(time-$t); warn time" 
phrases.small >out.txt

1300851145.745 at -e line 1.
977517.104726817 at -e line 2, <> line 1000000.
1300851146.77491 at -e line 2, <> line 1000000.
[download]

And now a little math.

The best of these is 977517.104726817 / 19484.8213140505 = 50.168132874892193598990165428789 50 times slower.

The worst of these is 977517.104726817 / 15088.645826608 = 64.784945975935041000983695239624 64 times slower.

The difference between 64 and my original figures from which you derived x120, is that these estimates are based on elapsed time, whereby my original figures are based upon actual cpu usage.

To counter your scurrilous "foobar"'d system possibility, I've had the same tests performed (albeit with a different randomly generated 1 million line dataset) on a dual core systems running both Windows/Perl 32-bit and linux, and those results are broadly in line with my results above. My friend & I hope that we will be able to repeat our provisional tests later this week using exactly the same datasets and a wider range of systems and setups, but based on our results to date, there is nothing to support your claimed "research".

As I have no idea what your "research" consisted of, because of course, you chose not to tell us. I can therefore not speculate about what mistakes or miss-assumptions you may have made whilst doing it, which is probably why you didn't tell us.

But, as there is small chance that three entirely different Perl installations--spanning compilers, OSs, hardware and continents--are "foobar"d in ways that make them exhibit strikingly similar relative performance, I'm confident to state that I think your "research" is more than just a little suspect. And your pronouncements based upon it, little more than guesswork and innuendo.

So, the next time you allude to "research" in order to support your mind's eye expertise, try making it something worthy of that epithet.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^8: (Innuendo and guesswork) Select or Download Code

Replies are listed 'Best First'.
Re^9: (Innuendo and guesswork) by tye (Sage) on Mar 23, 2011 at 16:34 UTC
Innuendo. You suggest "mistakes", but typically don't identify them so they cannot be countered. There is one trivial mistake, -l, that has exactly NO effect on the performance. The one mistake that had a huge impact on performance I did identify quite specifically. The other mistakes indeed made no impact on performance (I never said otherwise) so there was no need for them to be "countered". It seems that your explanation is that "lock contention" causes a non-linear "slow down" in which a huge increase in CPU usage is the result. So Thead::Queue is using spin locks? If so, somebody should certainly fix that. If not, your theory makes no sense to me. I suspect you just failed to understand the "non-linear slow-down" I was describing and so are explaining some other thing that you have dreamed up (like contention for the disk head). I certainly don't see how restarting via exec would change the amount of lock contention. As for not posting any code, I used your code. Of course, I couldn't reproduce your results because you made that impossible by not posting how you created your mythical 3.5GB input file. So I only did worse than you in one way; I used posted code but posted no details about the input file to enable the test to be reproduced while you posted the rough size of the input file. Right after I read your evidence that introducing Thread::Queue made processing go from "less than 2 minutes" to "~4 hours" (over 120x slower based on my stab at "math on vague numbers") I thought two things: 1) I didn't realize that Thread::Queue was so very badly implemented despite the good things people say about it; 2) Surely BrowserUk also used it badly for the performance penalty to be that ridiculous. Regarding (1), I ran a very quick test to see how horrible Thread::Queue was and found that it wasn't horrible. So I ended up spending some time trying to guess how you had such radically different results. Regarding (2), I saw that you were using a separate queue for each thread, so the contention for locks should be minimal. Indeed, I eventually showed that contention for locks had no significant role in the "just over 2"x CPU usage increase that I saw over and over. I also saw you using a single output semaphore and using threads::shared for it. I tried replacing it with a semaphore that I knew to be efficient and also removing the semaphore. But there was no real lock contention there either so those made no real difference. In wasting time trying to reverse engineer your input file, I obviously used much, much smaller input files. In multiple runs on multiple systems using different sizes of input and a couple of vastly different record sizes (always using 'a'x$N as the record, so that the processing would be somewhat less ridiculously trivial in stark contrast to the problem being discussed), I got a lot of runs all showing exactly what I described: multiple threads, whether in the same process or not, take slightly longer elapsed time to use just over twice as much CPU to accomplish the same thing. Yes, I fixed two significant bugs in your code so that I could verify that my tests were actually getting the same work done. But those fixes made no difference in performance and so I didn't waste time writing them up. The mere act of verifying the output made those mistakes obvious so I had no doubt you'd easily find and fix them as trivially as I did (well, you found one, it seems). The third mistake (using `s/a/A/` when not using threads but using `s/a/A/g` when using threads) made a /huge/ difference in performance (in my attempts to reverse engineer your mysterious input file) and would also have been trivially found had any testing been done. It also makes no sense for the trivial, linear processing of a huge list of records to cause non-linear lock contention. Indeed, I posted the code I used to demonstrate that lock contention is not the source of the slow-downs I saw. So, I had numerous iterations of bits of code, most of which were slightly modified versions of the code you posted. All of the modifications proved to have no significant impact on performance. So I saw no point in wasting time posting any of those variations. And I just followed your lead in not saying how I had generated my input files. Quite terrible of me. For timing, I just used 'time' (from Cygwin when on Windows). Having now seen "the quick brown fox" input, another explanation for the difference is that your example of "processing" was so ridiculously small that it is no wonder that even rather minor changes in the processing of the input completely swamp it. In contrast, I was doing trivial amounts of processing but not quite so ridiculously trivial. In quickly swapping each byte of each line, my "processing step" was not trivially small compared to the work of copying each line from the parent thread to a child thread. The non-linear slow-down (CPU usage ramp up, really) that I noticed was that multiplying the size of the input file by $N made the CPU used by the Thread::Queue example go up by somewhat more than $N-fold. For the simple example, there is no reason this should be the case (except perhaps trivially so based on noise like memory fragmentation but I was looking at the ratio of CPU used for the Thread::Queue example vs. CPU used for the fixed "perl -pe" example). I found that I could make the Thread::Queue example use a little bit less CPU be inserting a bunch of "sync all of the threads and then exec($^X,$0)" steps throughout the process. I believe that indicates something stupid is going on underneath. I'm glad to see that this isn't so bad as to account for the discrepancies in our numbers. I've done a lot of work with lock contention. You seem to think it is close to a "given", something that /always/ happens when you split work between threads. That isn't the case. And when it happens, CPU usage goes /down/ (unless you are just using spin locks, of course, which then one should just fix). There is work (having nothing to do with locks) in copying a record into a data structure and passing that data structure to a sister interpreter instance so it can copy that data into a buffer from it's thread-specific memory allocation pool. It seems that that work adds up to a lot more than s/a/A/g on a record that is mostly not 'a's. This doesn't seem surprising and doesn't require the invention of "lock contention" to explain it. So, since you have revealed the input you used for these experiments and I have now revealed that I used 'a'x$N ($N didn't seem to matter much but 1024 was mostly what I used), you can probably reproduce my results. It looks like it was all a tempest in a teapot caused by doing something trivial many millions of times so that minor changes can cause dramatic results. If one is trying to get non-trivial work done, the minor micro dis-optimizations add up to ho-hum differences in the large. - tye	[reply] [d/l] [select]
Re^10: (Innuendo and guesswork) by BrowserUk (Patriarch) on Mar 23, 2011 at 18:07 UTC
Yes, I fixed two significant bugs in your code That is an outright lie. There simply aren't "two significant bugs" in that code. Neither of those trivia you've identified make the slightest bit of difference to the reproducible data posted. And if you don't know Threads::Shared doesn't use spin-locks, you've made my point. Regarding (2), I saw that you were using a separate queue for each thread, so the contention for locks should be minimal. Indeed, I eventually showed that contention for locks had no significant role in the "just over 2"x CPU usage increase that I saw over and over. Oh dear. You showed nothing. You stated such, but it was either based on guesswork or bad testing. This is two threads communicating through a queue as fast as it is possible for them to run: `#! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use Time::HiRes qw[ time sleep ]; my $dqd :shared = 0; my $nqd :shared = 0; my $Q = new Thread::Queue; my $start = time; async { while( $Q->dequeue ) { ++$dqd; } }->detach; async { for ( 1 .. 1e6 ) { $Q->enqueue( $_ ); ++$nqd; } }->detach; while( sleep 0.1 and $dqd < 1e6 ) { my $e = time - $start; printf STDERR "\rn;%7d (%.3f/s) d:%7d (%.3f/s)", $nqd, $nqd / $e, $dqd, $dqd / $e; }` [download] `C:\test>junk73 n;1000000 (44406.955/s) d: 994850 (44178.215/s) Elapsed: 22.619 seconds` [download] If you run that, you might be lucky to see a throughput of 45,000 lines/second maximum. And remember there is no IO going on there at all. Add that back and you'll get a much lower throughput rate. (As demonstrated above!) But reading from and writing to disk: `C:\test>perl -MTime::HiRes=time -e"BEGIN{$t=time()}" -nE"say;}{ warn t +ime() -$t" phrases.small >out.txt 0.944000005722046 at -e line 2, <> line 1000000.` [download] Somewhat over 1,000,000 lines/second. You might achieve more if you have a faster disk. 1e6 / 45e3 is 20 times slower. And that's absolute minimum. And if we drop the disk head contention to make it comparible with the above: `C:\test>perl -MTime::HiRes=time -e"BEGIN{$t=time()}" -nE"}{ warn time( +) -$t" phrases.small 0.388376951217651 at -e line 2, <> line 1000000.` [download] Then we get a throughput of 2.57 million lines/second giving 2.57e6 / 45e3 = 50x slower. So, once again I say. I don't know how you came up with your "2x slower", but it is just plain wrong. For the rest, it is clear to me that you arguing on the basis of guesswork, and it simply isn't worth my time to bother to argue with you further. Since you won't post code to back up your original "solutions" to the OP, he'll have to guess how those magical solutions are meant to work, or use real code he can run for himself; and real timings he (and anyone; 'cept maybe you) can verify for themselves. Update: As promised, here are the comparible figures using the exact same file as I originally used (Not the quick brown fox example above), as run on dual core systems under Windows and linux by a third party: Firstly, 32-bit perl 5.12.3 on Windows: C:\_32\pscrpt>perl buk.pl -T=1 phrases.small >out.txt Started Thu Mar 24 04:52:24 2011 1000000 14225.3834185375 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 04:53:40 2011 C:\_32\pscrpt>perl buk.pl -T=2 phrases.small >out.txt Started Thu Mar 24 04:54:03 2011 1000000 12605.8696080362 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 04:55:22 2011 C:\_32\pscrpt>perl buk.pl -T=3 phrases.small >out.txt Started Thu Mar 24 04:55:54 2011 1000000 12348.060968551 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 04:57:15 2011 C:\_32\pscrpt>perl buk.pl -T=4 phrases.small >out.txt Started Thu Mar 24 04:57:43 2011 1000000 12225.405921681 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 04:59:05 2011 And, with 32-bit perl-5.10.something on linux: [rob@localhost pscrpt]$ /home/rob/perl510-multi/bin/perl buk.pl -T=1 p +hrases.small >out.txt Started Thu Mar 24 05:03:23 2011 1000000 8861.7641709435231 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 05:05:23 2011 [rob@localhost pscrpt]$ /home/rob/perl510-multi/bin/perl buk.pl -T=2 p +hrases.small >out.txt Started Thu Mar 24 05:05:47 2011 1000000 5387.58673401599584 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 05:08:53 2011 [rob@localhost pscrpt]$ /home/rob/perl510-multi/bin/perl buk.pl -T=3 p +hrases.small >out.txt Started Thu Mar 24 05:09:11 2011 1000000 4805.8109482304384 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 05:12:40 2011 [rob@localhost pscrpt]$ /home/rob/perl510-multi/bin/perl buk.pl -T=4 p +hrases.small >out.txt Started Thu Mar 24 05:14:26 2011 1000000 4558.58825262797919 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 05:18:05 2011 [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^11: (Innuendo and guesswork) by tye (Sage) on Mar 23, 2011 at 23:15 UTC
Yes, I fixed two significant bugs in your code That is an outright lie. There simply aren't "two significant bugs" in that code. Neither of those trivia you've identified make the slightest bit of difference to the reproducible data posted. 1) Since I can't reproduce any of the "data" posted in the node with the bugs (since I still don't know what your input file was, other than its size), I wonder what "reproducible data posted" you are talking about. 2) The "trivia" of leaving off /g had a huge impact on performance for me. If I could reproduce your performance data, then I could determine by how much the lack of /g changed it for your secret input file. After making my own input file, the lack of /g made a 10x change in performance (for your code). 3) I was repeatedly quite clear that the two other bugs did not significantly impact performance (which is the data you posted). They were significant in that they caused incorrect output to be produced which was not acceptable to me since I didn't want to do a bunch of benchmarks and then notice that I wasn't actually doing the same work (like you did). I don't know how you came up with your "2x slower" [....] Since you won't post code I used your code. The only code I didn't fully post was trivia such as for generating the input and gathering the timings. If you can't figure that out without me spoon-feeding it to you, then here: `time perl -le'print "a"x1024 for 1..100_000' > 1kLines.txt time perl -pe's/a/A/g' 1kLines.txt > output.cmp time perl bukQueue.pl 1kLines.txt > output.tmp diff -sq output.cmp output.tmp` [download] and real timings he (and anyone; 'cept maybe you) can verify for themselves. Oh, I didn't realize you had distributed your input file to the OP (and everyone except me). The change in performance is drastically impacted by the input used. For the stupidest-possible benchmark, you can use this instead: `time perl -le'print "z"x1024 for 1..100_000' > 1kLines.txt` [download] which gives me 10x CPU consumption difference (still nowhere near your original "reproducible data" of "less than 2 minutes" vs "~4 hours"). The more realistic but still silly comparison gave me results similar to: time perl -le'print "a"x1024 for 1..100_000' > 1kLines.txt real 0m00.685s user 0m00.228s sys 0m00.460s cpu 0m00.688s time perl -pe's/a/A/g' 1kLines.txt > output.cmp real 0m44.254s user 0m43.323s sys 0m00.924s cpu 0m44.247s rm -f output.tmp time perl tyeQueue.pl 1kLines.txt real 0m52.719s user 1m32.618s sys 0m02.692s cpu 1m35.310s diff -sq output.cmp output.tmp Files output.cmp and output.tmp are identical time perl bukQueue.pl 1kLines.txt > output.tmp real 0m49.997s user 1m35.598s sys 0m01.696s cpu 1m37.294s diff -sq output.cmp output.tmp Files output.cmp and output.tmp differ [download] CPU consumption goes from just under 45 seconds to a bit over 90 seconds. 2.2x more CPU used. Feel free to scale up to 3.5GB, change the line contents, and reproduce your "2 minutes vs 4 hours" data but this time in a way that anyone with hours to waste on rather meaningless benchmarking can actually reproduce. When you do that, also show some benchmarks for your chosen line contents but a small fraction of the 3.5GB so people with less copious time to waste can chip in. And then also explain your guess as to why the smaller file gives 2x..10x CPU consumption difference while your 3.5GB file produces a 120x difference (or whatever it is). Since you keep emphasizing "120x" as being "my" number, perhaps you could also tell me what number you come up with for "<2 minutes vs ~4 hours". You said "my original figures are based upon actual cpu usage". The first performance figure I see from you is "That will take ~4 hours to process a 3.5 GB file". That sounds like elapsed time to me. So I bet the "120x" is not accurate. It was convenient short-hand and should be in the general ballpark based on "Neither of those trivia you've identified make the slightest bit of difference to the reproducible data posted". It'd be nice if rather than just hinting that "120x" is an unreasonable interpretation that you would actually provide what the clear performance differences are for your original runs (the "<2 minutes" and "~4 hours" runs). If it was actually "~100x CPU usage" or "~50x elapse time (using two cores)" or whatever, just state it clearly. And, yes, I haven't posted the code for tyeQueue.pl. As you can see, the performance differences compared to bukQueue.pl are trivial. The output from it is correct, however. I played around with a bunch of things in tyeQueue.pl and they made no difference in performance. So there is no point getting distracted with one particular version of that trivia. - tye	[reply] [d/l] [select]
Re^12: (Innuendo and guesswork) by BrowserUk (Patriarch) on Mar 24, 2011 at 05:11 UTC
Re^13: (Innuendo and guesswork) by ikegami (Patriarch) on Mar 24, 2011 at 18:48 UTC