Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Parallel::ForkManager and wait_all_children

by rgren925 (Beadle)
on May 13, 2015 at 23:02 UTC ( #1126616=note: print w/replies, xml ) Need Help??


in reply to Parallel::ForkManager and wait_all_children

Thanks all for the replies.

I've tried to synthesize the different approaches.

To flesh this out a bit, I'm planning on using callbacks (run_on_wait) to manage notifying/killing a hung process. Using the alarm(TIMEOUT) doesn't really solve my problem. If I set the timeout to 60, no other processes will run until that 60 seconds has elapsed as the wait_all_children still isn't satisfied.

wait_for_available_procs (which was newer than my version of Parallel::ForkManager--so I upgraded) didn't seem to make any difference.

The callbacks indicate that everything stalls until the looping test4.sh script is killed.

use strict; use warnings; use Parallel::ForkManager; use constant TIMEOUT => 60; my @runArray = ("test1.sh", "test2.sh", "test3.sh", "test4.sh", "test5 +.sh"); my ($pid, $exitCode, $ident); my $forkMgr = Parallel::ForkManager->new(3); $forkMgr->run_on_start( sub { ($pid, $ident) = @_; print "Started ==> $ident\n"; } ); $forkMgr->run_on_finish( sub { ($pid, $exitCode, $ident) = @_; print "Ended ==> $ident\n"; } ); while (1) { for my $runCommand (@runArray) { $forkMgr->start($runCommand) and next; alarm(TIMEOUT); system("/usr/localcw/opt/patrol/nagios/libexec/$runCommand") o +r die ("exec: $!\n"); } $forkMgr->wait_all_children; sleep 10; } exit;

Replies are listed 'Best First'.
Re^2: Parallel::ForkManager and wait_all_children
by ikegami (Patriarch) on May 14, 2015 at 14:34 UTC

    If I set the timeout to 60, no other processes will run until that 60 seconds has elapsed as the wait_all_children still isn't satisfied.

    Only in the rare instances when it hangs, and only because it takes that long for my method to detect that a process has become hung. That's as good as it gets without inside knowledge of the tests being run. If you know more about the tests being run (especially if you have the power to change them), then a much more responsive solution can be created.


    wait_for_available_procs (which was newer than my version of Parallel::ForkManager--so I upgraded) didn't seem to make any difference.

    wait_for_available_procs(3) won't make a difference.

    wait_for_available_procs(1) will make a difference, but it introduces a bug and merely postpones the problem.


    You introduced some major bugs in the code. Check your process list when it runs.

    1. Call finish after system

    2. You're killing the wrong process. You're not killing the child that's running the test. You're going to end up with lots of hung processes running. Already showed how to send the signal to the right process, and I already showed a much much simpler solution.

      I am now using your code posted above (using open3 instead of exec/system, etc.):
      use strict; use warnings; use Parallel::ForkManager; use IPC::Open3 qw( open3 ); use POSIX qw( WNOHANG ); use constant TIMEOUT => 120; my @runArray = ("test1.sh", "test2.sh", "test3.sh", "test4.sh", "test5 +.sh"); my ($pid, $exitCode, $ident); my $currentTime; my $forkMgr = Parallel::ForkManager->new(3); $forkMgr->run_on_start( sub { ($pid, $ident) = @_; print "$currentTime Started ==> $ident\n"; } ); $forkMgr->run_on_finish( sub { ($pid, $exitCode, $ident) = @_; print "$currentTime Ended ==> $ident\n"; } ); while (1) { $currentTime = localtime(); for my $runCommand (@runArray) { $forkMgr->start($runCommand) and next; my $pid = open3('<&STDIN', '>&STDOUT', '>&STDERR', "/usr/localcw/opt/patrol/nagios/libexec/$runCo +mmand"); wait_for_test_to_end($pid); $forkMgr->finish($? & 0x7F ? 0x80 | ($? & 0x7F) : $? >> 8); } $forkMgr->wait_all_children; sleep 10; } exit; sub wait_for_test_to_end { my ($pid) = @_; my $abs_timeout = time() + TIMEOUT; while (1) { return if waitpid($pid, WNOHANG) > 0; last if time() > $abs_timeout; sleep(1); } kill(ALRM => $pid); $abs_timeout = time() + 15; while (1) { return if waitpid($pid, WNOHANG) > 0; last if time() > $abs_timeout; sleep(1); } kill(KILL => $pid); waitpid($pid, 0); }
      Still same behavior.
      The looping test4.sh is still hanging everything up.
      Here's some trace output. The "Started/Ended" statements are coming from the callbacks and the "I am running..." are coming from the test1-5.sh scripts.
      Thu May 14 16:52:40 2015 Started ==> test1.sh Thu May 14 16:52:40 2015 Started ==> test2.sh Thu May 14 16:52:41 CDT 2015 I am running test1.sh Thu May 14 16:52:41 CDT 2015 I am running test2.sh Thu May 14 16:52:41 CDT 2015 I am running test3.sh Thu May 14 16:52:40 2015 Started ==> test3.sh Thu May 14 16:52:40 2015 Ended ==> test3.sh Thu May 14 16:52:40 2015 Ended ==> test2.sh Thu May 14 16:52:40 2015 Ended ==> test1.sh Thu May 14 16:52:40 2015 Started ==> test4.sh Thu May 14 16:52:43 CDT 2015 I am running test4.sh Thu May 14 16:52:43 CDT 2015 I am running test5.sh Thu May 14 16:52:53 CDT 2015 I am running test4.sh Thu May 14 16:53:03 CDT 2015 I am running test4.sh Thu May 14 16:53:13 CDT 2015 I am running test4.sh Thu May 14 16:53:23 CDT 2015 I am running test4.sh Thu May 14 16:53:33 CDT 2015 I am running test4.sh Thu May 14 16:53:43 CDT 2015 I am running test4.sh Thu May 14 16:53:53 CDT 2015 I am running test4.sh Thu May 14 16:54:03 CDT 2015 I am running test4.sh Thu May 14 16:54:13 CDT 2015 I am running test4.sh Thu May 14 16:54:23 CDT 2015 I am running test4.sh Thu May 14 16:54:33 CDT 2015 I am running test4.sh Thu May 14 16:54:43 CDT 2015 I am running test4.sh Thu May 14 16:52:40 2015 Started ==> test5.sh Thu May 14 16:52:40 2015 Ended ==> test5.sh Thu May 14 16:52:40 2015 Ended ==> test4.sh Thu May 14 16:54:56 2015 Started ==> test1.sh Thu May 14 16:54:56 2015 Started ==> test2.sh Thu May 14 16:54:56 CDT 2015 I am running test1.sh Thu May 14 16:54:56 CDT 2015 I am running test2.sh Thu May 14 16:54:56 CDT 2015 I am running test3.sh Thu May 14 16:54:56 2015 Started ==> test3.sh Thu May 14 16:54:56 2015 Ended ==> test2.sh Thu May 14 16:54:56 2015 Ended ==> test1.sh Thu May 14 16:54:56 2015 Ended ==> test3.sh Thu May 14 16:54:56 2015 Started ==> test4.sh Thu May 14 16:54:58 CDT 2015 I am running test4.sh Thu May 14 16:54:58 CDT 2015 I am running test5.sh Thu May 14 16:55:08 CDT 2015 I am running test4.sh Thu May 14 16:55:18 CDT 2015 I am running test4.sh etc.
      During the 2-minute timeout wait to kill test4.sh, nothing else is happening (not even the run_on_start/finish for test5.sh). I still have 2 forkable processes (of the defined 3) that are not being used, I believe, because forkmanager is waiting for all the children to be done. I recognize that one process will be tied up for the timeout value, but I need the other two to continue processing available work (test1-3.sh and test5.sh). I'll take care to ensure test4 doesn't run again while there is one already running (using a hash of running jobs managed by the callbacks.

      That is the crux of my problem.

        During the 2-minute timeout wait to kill test4.sh, nothing else is happening (not even the run_on_start/finish for test5.sh). I still have 2 forkable processes (of the defined 3) that are not being used

        That's completely false!!!

        on_start for test5 runs long before test4 finishes. Check the timestamps; your output is out of order. (You probably redirected the output, which caused STDOUT to become block buffered. Use $| = 1; to disable buffering on STDOUT.)

        Thu May 14 16:52:40 2015 Started ==> test4.sh Thu May 14 16:52:40 2015 Started ==> test5.sh Thu May 14 16:52:43 CDT 2015 I am running test4.sh Thu May 14 16:52:43 CDT 2015 I am running test5.sh [If you were to print out when test5 ends, it would be here] Thu May 14 16:52:53 CDT 2015 I am running test4.sh Thu May 14 16:53:03 CDT 2015 I am running test4.sh Thu May 14 16:53:13 CDT 2015 I am running test4.sh Thu May 14 16:53:23 CDT 2015 I am running test4.sh Thu May 14 16:53:33 CDT 2015 I am running test4.sh Thu May 14 16:53:43 CDT 2015 I am running test4.sh Thu May 14 16:53:53 CDT 2015 I am running test4.sh Thu May 14 16:54:03 CDT 2015 I am running test4.sh Thu May 14 16:54:13 CDT 2015 I am running test4.sh Thu May 14 16:54:23 CDT 2015 I am running test4.sh Thu May 14 16:54:33 CDT 2015 I am running test4.sh Thu May 14 16:54:43 CDT 2015 I am running test4.sh

        When test5 is running, there's only one unused process, and that's because there's nothing for that process to do. Only when test5 finishes do you have two unused processes, and that's because there's nothing for those two processes to do (since you want all processes to end before pausing before running the test suite again).

        You are right about on_finish, though. on_finish code is not guaranteed to run immediately when the child ends. It will run when the parent reaps the child, which will happen at some point before wait_all_children returns.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1126616]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2023-10-04 01:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?