Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Useful number of childs revisited [SOLVED]

by karlgoethebier (Abbot)
on May 08, 2015 at 10:54 UTC ( #1126078=perlquestion: print w/replies, xml ) Need Help??

karlgoethebier has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

While playing around with Parallel::ForkManager i stumbled over Re^3: Fork vs pThreads where BrowserUk wrote:

"...running 50 tasks concurrently on 4 cpus will take longer than running those same 50 tasks; but only 4 at any given time"

This sounds logical.

I wrote this:

#!/usr/bin/env perl use strict; use warnings; use Math::BigInt; use Parallel::ForkManager; use Time::HiRes qw ( time ); use feature qw(say); use Iterator::Simple qw(iter); my @numbers = ( 1 .. 2000 ); my $processes = shift; my $tmp = q(/tmp/); my $pm = Parallel::ForkManager->new( $processes, $tmp ); my %data = map { $_ => undef } @numbers; say qq(numbers: ), scalar @numbers; say qq(processes: $processes); $pm->run_on_finish( sub { my ( $id, $result ) = ( $_[2], $_[5] ); $data{$id} = ${$result} } ); my $start=time; my $iterator = iter(\@numbers); while(defined( my $number = $iterator->())) { $pm->start($number) and next; my $factorial = Math::BigInt->bfac($number); $pm->finish( 0, \$factorial ); } $pm->wait_all_children; say qq(fork: ), time - $start; %data = map { $_ => undef } @numbers; $start = time; for my $number (@numbers) { my $factorial = Math::BigInt->bfac($number); $data{$number} = $factorial; } say qq(for: ), time - $start;

I think the results are quite surprising:

karl@host:~$ ./forker.pl 50 numbers: 2000 processes: 50 fork: 22.5553870201111 for: 33.7985281944275 karl@host:~$ ./forker.pl 4 numbers: 2000 processes: 4 fork: 501.324076890945 for: 33.1997199058533

I assume (and hope) there is nothing wrong with my code but i have the vague feeling that i make some false assumptions or jump to some wrong conclusions.

What is my error in reasoning (if any)?

Update:

Many thanks to all for this interesting thread.

Update 2: Please see Allow for sub-second resolution #5 and Changes.

Thank you very much for any hints and best regards, Karl

«The Crux of the Biscuit is the Apostrophe»

Replies are listed 'Best First'.
Re: Useful number of childs revisited
by BrowserUk (Patriarch) on May 08, 2015 at 12:22 UTC

    Humour me and post your results from running this simplified version of your code with arg 4(assuming you have 4 cores?) & 50:

    #!/usr/bin/env perl use strict; use warnings; use Math::BigInt; use Parallel::ForkManager; use Time::HiRes qw ( time ); use feature qw(say); my $processes = shift; my $pm = Parallel::ForkManager->new( $processes ); say qq(processes: $processes); my $start=time; for my $number ( 1 .. 50 ) { $pm->start( $number ) and next; my $factorial = Math::BigInt->bfac( 2000 ) for 1 .. 10; $pm->finish( 0 ); } $pm->wait_all_children; say qq(fork: ), time - $start;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      "Humour me..."

      My pleasure:

      karl@host~$ ./buk.pl 4 processes: 4 fork: 13.0191400051117 karl@host:~$ ./buk.pl 50 processes: 50 fork: 7.12985706329346

      My best regards, karl

      P.S.: British humour?

      «The Crux of the Biscuit is the Apostrophe»

        Could also post runs with 8, 16 & 32 please.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Useful number of childs revisited
by KurtSchwind (Chaplain) on May 08, 2015 at 14:14 UTC

    Several years ago I was writing ETL for a very large datawarehouse (think top 10 databases in the world, although this was over a decade ago). And I did this on Sun's E10k frames which were multiple CPU (this pre-dates multi-core).

    Because there are different ways of being 'bound', Memory-Bound, CPU-Bound, Network-Bound, Disk I/O-Bound, etc... the only way to truly know the optimal number of threads is by trial and error. At least with a program of any complexity. So you run tests and trials and you plot it out and it should be pretty clear where the sweet spot is.

    --
    “For the Present is the point at which time touches eternity.” - CS Lewis
      "...run tests and trials... should be pretty clear where the sweet spot is"

      Yes - what else should i do?

      But i think that you agree that the phenomenon i described is - to put it mildly (gelinde gesagt?) - surprising unexpected.

      Edit: Minor change of wording - i said this already.

      Thanks and best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        BTW: The simple cure for the problem would be to switch sleep for select( '','','', $timeout ), which would allow the timeout to be specified in fractional seconds; and set the default to something like 0.01. (Without adding a dependency upon Time::HiRes )


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Useful number of childs revisited [SOLVED]
by Laurent_R (Canon) on May 08, 2015 at 17:57 UTC
    I will talk of OS heavyweight shell processes, not threads, and certainly not Parallel::ForkManager. It even does not have much to do with Perl (although Perl is used in our programs). But I still believe that this is not off-topic and actually quite relevant.

    At my job, we are very often extracting data from seven large databases, each database having eight sub-databases. This is not like just table dumping, there is a lot of business logic in this extraction process, so that the process is heavily IO-bound, but also in part CPU-bound. These extraction processes are very long: from 4 to 8 hours for most, up to 3 or 4 days for a couple of them for the full data extraction to complete (and yes, we are extracting a very large volume of data).

    What we do is to launch 7 * 8 = 56 processes through a queuing system and maintain the maximum number of active processes at a certain level, the other processes are just pending doing nothing until one slot becomes free for one of them. We have 4 CPU on our server. We found that, usually, the optimal number of processes running concurrently is somewhere between 8 and 12. (We have about 50 different data extraction applications, some are more heavily IO-bound than others, so that the optimal number of processes will vary to a certain extent with the nature of the extraction being run.)

    Less than 8 processes in parallel, and the server appears to be underutilized (although we are doing a few other things on this server, it is really essentially dedicated to these heavy extractions tasks). More than 12 processes, and it appears that the overhead of context-switches starts to slow down the overall execution performance (the processes in themselves are not very memory-intensive, but there could be some underlying data caching, buffering and pre-fetching leading to a real memory consumption higher than what we think).

    Anyway, in view of that, we usually set the queue to a maximum of about 10 processes running in parallel for our 4-CPU server.

    Je suis Charlie.
      so that the process is heavily IO-bound, but also in part CPU-bound.

      In that case, your process isn't either of those. It's just regular mixed processing; and in reality as its extracting a very large volume of data, it probably qualifies as memory-bound.

      we usually set the queue to a maximum of about 10 processes running in parallel for our 4-CPU server

      With a mixed mode process; that is the sensible choice as allows for greater utilisation of both resources.

      • Whilst some processes are hogging the CPUs in their cpu-bound sections; other processes can still be making forward process because they have IO completing whilst they are not occupying a cpu.
      • And whilst some processes are waiting for IO to complete, there are other processes that can utilise the cpus that would otherwise stand idle waiting for that IO to complete.

      But for pure cpu-bound processing, running more processes that there are cpus has the effect of a net increase in overall elapsed time; because it causes more context switches and more cache misses.

      In an ideal world of 1 process per cpu, those processes will tend to always occupy the same cpus; thus the caches, especially the L1 caches closest to the cpus, will retain the same data across preemptions. And when preemptions occur; as there are no other processes to be run, the same processes just get another timeslice and pick up right from where they left off.

      It's very rarely an ideal world on a modern OS; there are always lots of other systems processes vying for a cpu; but still it is the case that many of those system processes do very little when they get a timeslot -- often just checking one or two flags or ports before relinquishing the rest of their allotment to the next process.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        In that case, your process isn't either of those. It's just regular mixed processing; and in reality as its extracting a very large volume of data, it probably qualifies as memory-bound.
        BrowserUk, thank you for your comment. Yeah, I guess "regular mixed processing" is a pretty fair description of our data extraction processes, but, still, most are usually more IO-bound than anything else. A few of them are algorithmically more complex or may for example require a certain amount of sorting, so these might be more mixed processing and possibly also memory-bound to a certain extent.

        But I do not think that "memory-bound" is right for most of our processes. Basically (and with some simplification, because there is quite an amount of business logic involved), the typical application we are running is doing something like this: reading every active subscriber from the database subscriber table; for each of these cell phone subscribers, going into two or three dozens other tables to look for specific more detailed information about the billing services, network services, supplement services, commercial segment, rate plan applicable, prepaid amounts, last bill date, next bill date, etc., for this subscriber, and, once all the relevant information about this subscriber has been collected, write it into a CSV flat file that will be used for further processing later on.

        The CSV line we are writing for one subscriber rarely exceeds several hundred bytes (a few thousands at most), but it still leads to large data volume, because there are about 35 million subscribers to be extracted, so that we are producing files ranging from several GB to tens of GB.

        At no point in this process do most of these programs use directly a lot of memory. Very little in fact, usually. But, as I mentioned in my previous post, the underlying database engine and the system may use quite a bit of memory for IO buffering, data caching, transaction maintenance and so on, but these are things on which we have only limited or no control.

        Having said that, there are some exceptions and some of our programs need to load a lot of reference data into memory (at least three parameter table associated with the call rating engine exceed one million records), but these programs are very different and do not require parallel processing, because they don't scan the full customer database but usually reprocess files (error calls, unallocated calls, error logs) whose sizes never exceed half a GB and are usually much smaller (typically a few MB).

        Also note that I discussed only one of our regular activities on one specific platform (two servers), we are doing many other things on other platforms, other applications and other OS's, but this activity is more or less the only one (that I know of) in which we really need to fine tune as much as we can a lot of parallel processing to improve performance.

        Je suis Charlie.

      Thank you Laurent for sharing your secret knowledge.

      "You can't say that" he said "It doesn't, and you can't, I won't, and it don't it hasn't, it isn't, it even ain't, and it shouldn't it couldn't" He told him, "No, no, no" I told him, "Yes, yes, yes" I said, "I do it all the time Ain't this boogie a mess"? (FZ)

      Anytime i put my hands on this stuff i feel a bit like this.

      And to be honest: this is one of the reasons why i always blenched from using threads.

      Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        Hi Karl,

        Hmm, I answered to your post more than 24 hours ago (immediately after my answer to [BrowserUk}, but it seems that I forgot to actually post after preview.

        I am not sure what I should understand with "secret knowledge", especially in view of the link.

        There is nothing secret there, I was only reporting results obtained in our team after many lengthly tests and benchmarks, nothing more. They only make sense for what our processes do, and probably not for different processes.

        As I said, I still thought that these results were not off-topic and were relevant. Sorry if they were off-topic and irrelevant.

        Having said that, I am puzzled with your answer, that's why I ask, but by no mean do I find myself offended by it.

        Je suis Charlie.
Re: Useful number of childs revisited
by RichardK (Parson) on May 08, 2015 at 13:55 UTC

    Using '/tmp' as the temp directory is badly broken on my machine. It spits lots of errors when exiting, and is trying to delete things that it doesn't own!

    So I change it to this :-

    mkdir '/tmp/1126078'; my $tmp = q(/tmp/1126078);

    and now there's no significant difference between the 2 runs. But other than that, my guess is that you're just not doing enough real work, and you're only measuring setup/teardown times.

    BTW I also got rid of the pointless iterator -- mainly because it's not installed here. What's wrong with for my n (@numbers) anyway?

      "Using '/tmp' as the temp directory is badly broken"

      Thank you very much for the hint.

      "...What's wrong with for my n (@numbers)..."

      Nothing is wrong with it. But the iterator was significantly faster in the example in my OP.

      Update: Please see also this bug

      Update 2:

      "...iterator was ... faster"

      I thought this might be of interest:

      karls-mac-mini:monks karl$ ./forker.pl numbers: 2000 processes: 4 fork: 31.1171970367432 for: 40.2419550418854

      I used this code...

      while(defined( my $number = $iterator->())) { $pm->start($number) and next; my $factorial = Math::BigInt->bfac($number); $pm->finish( 0, \$factorial ); } $pm->wait_all_children;

      ...vs. this:

      for my $number (@numbers) { $pm->start($number) and next; my $factorial = Math::BigInt->bfac($number); $pm->finish( 0, \$factorial ); } $pm->wait_all_children;

      Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

      Using '/tmp' as the temp directory is badly broken on my machine.
      Sorry to hear that. Why not take it in for repairs?

Re: Useful number of childs revisited
by sundialsvc4 (Abbot) on May 08, 2015 at 11:56 UTC

    Your results will depend – will depend entirely – on “exactly what-it-is that the threads or processes are doing.”

    In this case, you seem to be calculating factorials.   This is a so-called CPU-Bound activity, in which every thread will always consume its full time-slice, until it is pre-empted by another thread which will dutifully consume its full time-slice, and so on.   Two threads will run twice as fast ass one; four threads, twice again as fast as two; but there, the improvements will stop, and slightly degrade to account for the overhead spent round-robin switching between threads.   (Probably too small to see.)   The capacity of the only ruling-constraint – the CPU – has been reached, and fully utilized, and of course cannot be exceeded.

    Most real-world activities are I/O-Bound, either directly, due to actual input/output that they do, or indirectly, due to virtual-memory page faults which they induce by trying to use (way ...)too-much memory.   These activities are dependent in their execution speed on the capacity of the system to perform I/O.   The threads/processes spend nearly all of their time waiting for an I/O activity:   either voluntarily, for an operation that they requested, or involuntarily due to a page-fault.   CPU utilization is relatively trivial.

    Trouble is, when an I/O-bound activity begins to get stoppered-up, the degredation of throughput is “at first, linear, then exponential.”   A plot of the performance curve has a nearly right-angle “elbow” to it ... a point called thrashing, or “hitting the wall” (with a grisly and final “thud”).   (Example:   “6-at-a-time = 4 minutes; 12-at-a-time = 9 hours.”   A bit extreme, yes, but long ago I saw it happen.)

    To avoid this, the best approach is to do what’s done in any fast-food restaurant:   maintain a manageable number of workers, each of which processes work from a thread-safe queue, so that, no matter how much work there is to do, the work in-process can be limited and adjusted.   (The waiting-line just gets longer, but the transactions/second remains stable.)   There are plenty of workload-management packages in CPAN to do this.

      "...do what’s done in any fast-food restaurant"

      Barfing?

      «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1126078]
Approved by hippo
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2022-05-19 02:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (71 votes). Check out past polls.

    Notices?