in reply to double fork trick vs sig chld wait

Sounds to me that you need to have a different signal handler in the top level dispatch loop and the launch script. Something like:

$SIG{CHILD} = 'IGNORE'; foreach my $job (@big_list_of_jobs) { my $pid = fork() unless( $pid ) { my $child_cleanup_func = sub { # Examine the child's output, update log files etc. }; $SIG{CHILD} = $child_cleanup_func; exec $child_cmd, @args; print "Child did not start\n"; } else { # parent # Forget about the child (launcher process from the fork above +) # move on to the next job. } }

I think it would also be a good idea to read perlipc

Replies are listed 'Best First'.
Re^2: double fork trick vs sig chld wait
by Voronich (Hermit) on Nov 19, 2010 at 15:57 UTC

    A little broader context: This script dispatches jobs into a grid platform. It executes the stub binaries, which execute their grid equivalents and wait until they exit, then returning to my script (where I pick up return codes, parse and process schlock from stdout, etc.)

    I actually tried $SIG{CHLD} = 'IGNORE'; first. The net result was that the stub binaries wouldn't submit the grid jobs. I threw up my hands at that. I just don't have grid-fu and the people who were supposed to had no idea why that would occur. (But it was exhaustively demonstrated.)

    So I went back to trying to decide between double-fork and setting $SIG{CHLD} to a simple sub that just "wait"s.

    (Yes, been back and forth through perlipc before posting here. ;) )

    http://www.mpwilson.com/uccu/

      This script dispatches jobs into a grid platform.

      Does the script wait for the job to complete, or just for the submission to complete?

      If the script isn't waiting for the job to complete, you shouldn't need to fork children. I've written a script to submit multiple jobs to Grid Engine by constructing a 'qsub' command and executing that via system, and the qsub command completed quickly enough that there wasn't a need to fork.

      If the grid system you're working with supports DRMAA you might want to look into Schedule::DRMAAc.

        The script waits for the executed command to complete (which blocks while waiting for the grid job to finish.)

        So it's semantically equivalent (at least at this level) with "run something and grab it's output and return code when it finishes."

        The voodoo of the control process that this script is executing and it's relationship to the grid service is entirely blackbox to the dispatcher.

        The other wrinkle is that there are occasional business-level failures of the jobs. I'm not expected to clean up any more than I can from those. But I don't want to take the thread pool route because of the possibility of polluting the dispatch loop.

        The more articles and chapters of books I read the more it seems like the $SIG{CHLD} = sub {$zombies++;} then add a reaper function to the while(1) loop, per Camel chapter 16 is the way to go.

        The docs to wait and waitpid are the kind of things you (read: I) have to read 20 or 30 times, saying "wait, what?" each time, then go to lunch while it soaks in.

        http://www.mpwilson.com/uccu/