in reply to Re^9: (Innuendo and guesswork)
in thread Using kernel-space threads with Perl

Yes, I fixed two significant bugs in your code

That is an outright lie. There simply aren't "two significant bugs" in that code. Neither of those trivia you've identified make the slightest bit of difference to the reproducible data posted.

And if you don't know Threads::Shared doesn't use spin-locks, you've made my point.

Regarding (2), I saw that you were using a separate queue for each thread, so the contention for locks should be minimal. Indeed, I eventually showed that contention for locks had no significant role in the "just over 2"x CPU usage increase that I saw over and over.

Oh dear. You showed nothing. You stated such, but it was either based on guesswork or bad testing.

This is two threads communicating through a queue as fast as it is possible for them to run:

#! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use Time::HiRes qw[ time sleep ]; my $dqd :shared = 0; my $nqd :shared = 0; my $Q = new Thread::Queue; my $start = time; async { while( $Q->dequeue ) { ++$dqd; } }->detach; async { for ( 1 .. 1e6 ) { $Q->enqueue( $_ ); ++$nqd; } }->detach; while( sleep 0.1 and $dqd < 1e6 ) { my $e = time - $start; printf STDERR "\rn;%7d (%.3f/s) d:%7d (%.3f/s)", $nqd, $nqd / $e, $dqd, $dqd / $e; }
C:\test>junk73 n;1000000 (44406.955/s) d: 994850 (44178.215/s) Elapsed: 22.619 seconds

If you run that, you might be lucky to see a throughput of 45,000 lines/second maximum. And remember there is no IO going on there at all. Add that back and you'll get a much lower throughput rate. (As demonstrated above!)

But reading from and writing to disk:

C:\test>perl -MTime::HiRes=time -e"BEGIN{$t=time()}" -nE"say;}{ warn t +ime() -$t" phrases.small >out.txt 0.944000005722046 at -e line 2, <> line 1000000.

Somewhat over 1,000,000 lines/second. You might achieve more if you have a faster disk.

1e6 / 45e3 is 20 times slower. And that's absolute minimum.

And if we drop the disk head contention to make it comparible with the above:

C:\test>perl -MTime::HiRes=time -e"BEGIN{$t=time()}" -nE"}{ warn time( +) -$t" phrases.small 0.388376951217651 at -e line 2, <> line 1000000.

Then we get a throughput of 2.57 million lines/second giving 2.57e6 / 45e3 = 50x slower.

So, once again I say. I don't know how you came up with your "2x slower", but it is just plain wrong.

For the rest, it is clear to me that you arguing on the basis of guesswork, and it simply isn't worth my time to bother to argue with you further.

Since you won't post code to back up your original "solutions" to the OP, he'll have to guess how those magical solutions are meant to work, or use real code he can run for himself; and real timings he (and anyone; 'cept maybe you) can verify for themselves.


Update: As promised, here are the comparible figures using the exact same file as I originally used (Not the quick brown fox example above), as run on dual core systems under Windows and linux by a third party:
Firstly, 32-bit perl 5.12.3 on Windows: C:\_32\pscrpt>perl buk.pl -T=1 phrases.small >out.txt Started Thu Mar 24 04:52:24 2011 1000000 14225.3834185375 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 04:53:40 2011 C:\_32\pscrpt>perl buk.pl -T=2 phrases.small >out.txt Started Thu Mar 24 04:54:03 2011 1000000 12605.8696080362 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 04:55:22 2011 C:\_32\pscrpt>perl buk.pl -T=3 phrases.small >out.txt Started Thu Mar 24 04:55:54 2011 1000000 12348.060968551 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 04:57:15 2011 C:\_32\pscrpt>perl buk.pl -T=4 phrases.small >out.txt Started Thu Mar 24 04:57:43 2011 1000000 12225.405921681 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 04:59:05 2011 And, with 32-bit perl-5.10.something on linux: [rob@localhost pscrpt]$ /home/rob/perl510-multi/bin/perl buk.pl -T=1 p +hrases.small >out.txt Started Thu Mar 24 05:03:23 2011 1000000 8861.7641709435231 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 05:05:23 2011 [rob@localhost pscrpt]$ /home/rob/perl510-multi/bin/perl buk.pl -T=2 p +hrases.small >out.txt Started Thu Mar 24 05:05:47 2011 1000000 5387.58673401599584 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 05:08:53 2011 [rob@localhost pscrpt]$ /home/rob/perl510-multi/bin/perl buk.pl -T=3 p +hrases.small >out.txt Started Thu Mar 24 05:09:11 2011 1000000 4805.8109482304384 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 05:12:40 2011 [rob@localhost pscrpt]$ /home/rob/perl510-multi/bin/perl buk.pl -T=4 p +hrases.small >out.txt Started Thu Mar 24 05:14:26 2011 1000000 4558.58825262797919 at buk.pl line 30, <> line 1000000. Ended Thu Mar 24 05:18:05 2011

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^11: (Innuendo and guesswork)
by tye (Sage) on Mar 23, 2011 at 23:15 UTC
    Yes, I fixed two significant bugs in your code
    That is an outright lie. There simply aren't "two significant bugs" in that code. Neither of those trivia you've identified make the slightest bit of difference to the reproducible data posted.

    1) Since I can't reproduce any of the "data" posted in the node with the bugs (since I still don't know what your input file was, other than its size), I wonder what "reproducible data posted" you are talking about.

    2) The "trivia" of leaving off /g had a huge impact on performance for me. If I could reproduce your performance data, then I could determine by how much the lack of /g changed it for your secret input file. After making my own input file, the lack of /g made a 10x change in performance (for your code).

    3) I was repeatedly quite clear that the two other bugs did not significantly impact performance (which is the data you posted). They were significant in that they caused incorrect output to be produced which was not acceptable to me since I didn't want to do a bunch of benchmarks and then notice that I wasn't actually doing the same work (like you did).

    I don't know how you came up with your "2x slower" [....] Since you won't post code

    I used your code. The only code I didn't fully post was trivia such as for generating the input and gathering the timings. If you can't figure that out without me spoon-feeding it to you, then here:

    time perl -le'print "a"x1024 for 1..100_000' > 1kLines.txt time perl -pe's/a/A/g' 1kLines.txt > output.cmp time perl bukQueue.pl 1kLines.txt > output.tmp diff -sq output.cmp output.tmp
    and real timings he (and anyone; 'cept maybe you) can verify for themselves.

    Oh, I didn't realize you had distributed your input file to the OP (and everyone except me). The change in performance is drastically impacted by the input used. For the stupidest-possible benchmark, you can use this instead:

    time perl -le'print "z"x1024 for 1..100_000' > 1kLines.txt

    which gives me 10x CPU consumption difference (still nowhere near your original "reproducible data" of "less than 2 minutes" vs "~4 hours").

    The more realistic but still silly comparison gave me results similar to:

    time perl -le'print "a"x1024 for 1..100_000' > 1kLines.txt real 0m00.685s user 0m00.228s sys 0m00.460s cpu 0m00.688s time perl -pe's/a/A/g' 1kLines.txt > output.cmp real 0m44.254s user 0m43.323s sys 0m00.924s cpu 0m44.247s rm -f output.tmp time perl tyeQueue.pl 1kLines.txt real 0m52.719s user 1m32.618s sys 0m02.692s cpu 1m35.310s diff -sq output.cmp output.tmp Files output.cmp and output.tmp are identical time perl bukQueue.pl 1kLines.txt > output.tmp real 0m49.997s user 1m35.598s sys 0m01.696s cpu 1m37.294s diff -sq output.cmp output.tmp Files output.cmp and output.tmp differ

    CPU consumption goes from just under 45 seconds to a bit over 90 seconds. 2.2x more CPU used.

    Feel free to scale up to 3.5GB, change the line contents, and reproduce your "2 minutes vs 4 hours" data but this time in a way that anyone with hours to waste on rather meaningless benchmarking can actually reproduce.

    When you do that, also show some benchmarks for your chosen line contents but a small fraction of the 3.5GB so people with less copious time to waste can chip in. And then also explain your guess as to why the smaller file gives 2x..10x CPU consumption difference while your 3.5GB file produces a 120x difference (or whatever it is).

    Since you keep emphasizing "120x" as being "my" number, perhaps you could also tell me what number you come up with for "<2 minutes vs ~4 hours". You said "my original figures are based upon actual cpu usage". The first performance figure I see from you is "That will take ~4 hours to process a 3.5 GB file". That sounds like elapsed time to me. So I bet the "120x" is not accurate. It was convenient short-hand and should be in the general ballpark based on "Neither of those trivia you've identified make the slightest bit of difference to the reproducible data posted".

    It'd be nice if rather than just hinting that "120x" is an unreasonable interpretation that you would actually provide what the clear performance differences are for your original runs (the "<2 minutes" and "~4 hours" runs). If it was actually "~100x CPU usage" or "~50x elapse time (using two cores)" or whatever, just state it clearly.

    And, yes, I haven't posted the code for tyeQueue.pl. As you can see, the performance differences compared to bukQueue.pl are trivial. The output from it is correct, however. I played around with a bunch of things in tyeQueue.pl and they made no difference in performance. So there is no point getting distracted with one particular version of that trivia.

    - tye        

      Finally! Some real data to work with.

      Just so we can compare like with like, here are your data & tests run on my machine:

      So, even with your suspiciously unrealistic choice of dataset, using 1 worker thread requires 15% more cpu & with 4, 26% more. (And that's the original point, but I'll come back to that.)

      But replacing every character of every line is ... let's just say extremely unrealistic for now. So let's see what happens when we use what is only a very slightly more realistic dataset with a 50% duty cycle:

      Now the differences are rather more obvious. Ranging from 36% more with 1 worker, and 93% more for 4 workers.

      And how about the length of those lines? I'd bet that of the data stored in text files around the world the vast majority of it has line lengths far shorter than 1K each. Even really huge datasets with really huge 'records' tend to wrap them at some terminal-friendly limit. Eg DNA & FASTA files.

      So let's do something about that also. Cut the line length to a more reasonable 100 bytes and increase the number of lines to 1e6 to maintain approximate parity with the dataset size:

      So, now with a dataset somewhat more likely to reflect the norm, we've got 1 worker taking 422% and 4 workers taking 1155% more. That's 4 to 11 times as long. Still a far cry from "my 120x" I hear you cry, but bear with me.

      When I came to appraise the possibilities I looked for an existing file of roughly the right size and found a file I'd generated for some similar purpose. 5GB of random 'phrases' in typical 'text file sized' lines. Perfect!:

      C:\test>dir phrases.txt 28/12/2010 00:09 5,033,440,760 phrases.txt C:\test>head phrases.txt hexagonal monstrance fronded repand trouped modelers fragged foresighted nescient epilogue athwart venging brickyards jeweled manometer telium telegony apparitors imperfects deontic impuni +ty totemists lire summered nighthawks leucocytes bronchodilators fissionable collapses luminances masquer notated giaour overhanded outtraveled moneys dourines undauntedly cockaded sub +ventions interrogatives ya prereform tattooing tablemate thunderclouds numbats preprogram pulque palanquin hobbled locoing faced overshot nickeled minimum pronunciations pressingly eagl +ing lovages fatalities churnings barned treys bombproof refillable C:\test>perl -pe"s/a/A/g}{warn join' ',$., times" phrases.txt > output +.cmp C:\test>perl -pe"s/a/A/g}{warn join' ',$., times" phrases.txt > nul 100000000 159.542 7.878 0 0 at -e line 1, <> line 100000000. C:\test>junk71 -T=1 phrases.txt > output.cmp Started Thu Mar 24 04:26:07 2011 1000000 28259.7636234697 at C:\test\junk71.pl line 30, <> line 1000000 +. 2000000 28300.5518302086 at C:\test\junk71.pl line 30, <> line 2000000 +. 3000000 28301.3527797113 at C:\test\junk71.pl line 30, <> line 3000000 +. 4000000 28299.3505386885 at C:\test\junk71.pl line 30, <> line 4000000 +. 5000000 28287.2627896293 at C:\test\junk71.pl line 30, <> line 5000000 +. Terminating on signal SIGINT(2) C:\test>junk71 -T=4 phrases.txt > output.cmp Started Thu Mar 24 04:32:15 2011 1000000 14366.7839914658 at C:\test\junk71.pl line 30, <> line 1000000 +. 2000000 14388.0751686352 at C:\test\junk71.pl line 30, <> line 2000000 +. 3000000 14387.2471354457 at C:\test\junk71.pl line 30, <> line 3000000 +. 4000000 14366.3711996768 at C:\test\junk71.pl line 30, <> line 4000000 +. 5000000 14356.1826309399 at C:\test\junk71.pl line 30, <> line 5000000 +. Terminating on signal SIGINT(2)

      And so I saw the one-liner complete in (well) under 2 minutes. And from the very consistent lines/sec output, I projected (100e6 / 14387) = 6950 seconds or 115 minutes. If you (care to) remember that my original post contained no timing code, so the timing was done rather more crudely. Hence my factor of two exaggeration.

      But, and here is the real point. Never in any of these tests, yours or mine, your dataset or mine, have any of us found a threads & queues solution that achieves a performance gain. And anything less than faster is a failure. Simply not worth the effort.

      Not once. Not even your highly, (some might even say suspiciously), unrealistic dataset and duty cycle. Indeed, the more threads and queues you throw at the problem, the longer it damn well takes.

      I'll let others decide for themselves about who followed a more realistic test scenario in the absence of any specifics about the OPs actual data and processing--which I asked for but never received.

      Bottom line: A threads & queues solution will never solve the OPs problem. And I seriously doubt, but don't have the expertise to know, nor your code to try, that a forks & pipes solution will achieve much either. Assuming realistic datasets and duty cycles.

      And the whole point of my posting some timing info was to force you to try and counter it. So the next time you get tempted to answer the OPs question with an untried, mind's eye solution:

      Create the threads first and then have each thread load just the data it needs (and don't share it, of course). Then there won't be extra copies of that stuff created.

      And only when pressed for more detail, did you 'fine tuning' that piece of off-the-cuff, based-on-nothing-but-what-you'd-read-somewhere, nonsense with more guesswork:

      Having the parent read in the data and hand off each piece to the appropriate thread(s) (I'm guessing via Thread::Queue might be a good way) is the most general method that springs to my mind. I'd probably do something similar except using processes and simple pipes, as I've often done.

      If you have neither the experience to know, nor the time to test your theories a little first, maybe you'll think twice, and so avoid wasting the OPs time pursuing impossible 'solutions'.

      That's me done. Last word is yours.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        But replacing every character of every line is ... let's just say extremely unrealistic for now. So let's see what happens when we use what is only a very slightly more realistic dataset with a 50% duty cycle:

        The work units are indeed quite small (1.0ms). But why then did you go and make the work units even smaller (0.5ms)?

        But, and here is the real point. Never in any of these tests, yours or mine, your dataset or mine, have any of us found a threads & queues solution that achieves a performance gain. And anything less than faster is a failure. Simply not worth the effort.

        Am I misreading your results? I see

        • Not threaded = 23s
        • T=1 = 25s
        • T=2 = 13s
        • T=3 = 10s
        • T=4 = 9s

        Huge gains except for a small penalty for T=1.