Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Monitoring child processes

by rovf (Priest)
on Mar 13, 2012 at 14:43 UTC ( [id://959364]=perlquestion: print w/replies, xml ) Need Help??

rovf has asked for the wisdom of the Perl Monks concerning the following question:

My program (running on Solaris and Linux only, no Windows compatibility needed) is designed to start child processes and "monitor" them in the following way:
  1. If all the child processes exited, the parent also should terminate
  2. If the parent process detects a certain condition, it should kill all childs and terminate
My first approach for the parent (which was flawed) went like this:
  • Creating the childs with the usual fork/exec mechanism and storing the child PIDs in a list.
  • In a loop, testing the aforementioned condition, and killing the childs if the condition is met
  • In the same loop, testing whether the childs are still alive (by sending them the pseudo-signal 0) and terminating the loop if no child is running anymore.
It was the last item, which did not work: The parent did not notice when I child exited. Inspecting the processes, I found that the exited child was marked as defunct, but still present in the process table. I guess that this is the reason why kill(0,...) still pretended that it could send the signal. Am I right so far?

I then thought that maybe the child could not deliver its SIGCHLD on exiting, so I added the following line to my program, prior to creating the first child process:

$SIG{CHLD}='IGNORE';
Indeed, my program now terminates immediately, when all its childs exit. Now I wonder: Why do I have to set this explicitly? What is the "default" interrupt handler for SIGCHLD?

And finally, I would like to ask you whether my approach to handling the child processes, is reasonable, or whether it maybe has other pitfalls, which I just don't see yet.
-- 
Ronald Fischer <ynnor@mm.st>

Replies are listed 'Best First'.
Re: Monitoring child processes
by moritz (Cardinal) on Mar 13, 2012 at 15:12 UTC

    perlipc has quite some material on handling child processes; in particular it points out that you should call waitpid for the dying child processes, so that they don't become zombies/defunct. Calling waitpid is probably more idiomatic than polling all your children.

      I had thought about waitpid with NO_WAIT (I have used a similar solution in a different application), but I was interested in get it to run by using kill 0. Actually, this seems to work so far, only that I didn't understand, why I had to 'IGNORE' SIGCHLD explicitly. Actually, I got the idea to use kill instead of waitpid from an example at perlipc:
      Another interesting signal to send is signal number zero. This doesn't actually affect a child process, but instead checks whether it's alive or has changed its UID.
      unless (kill 0 => $kid_pid) { warn "something wicked happened to $kid_pid"; }
      This seemed to be an attractive and easy solution - that's why I wanted to try it out.

      -- 
      Ronald Fischer <ynnor@mm.st>
Re: Monitoring child processes
by cdarke (Prior) on Mar 13, 2012 at 15:40 UTC
    In addition to moritz's comment about waitpid: SIGCHLD is handled differently to other signals on most UNIX's, in that it does not by default kill the process it signals.
    On Linux, check man 7 signal for more information. It is possible that Solaris handles them differently, so it would be wise to look at the equivalent man pages there as well.
      SIGCHLD is handled differently to other signals on most UNIX's, in that it does not by default kill the process it signals.
      Correct, and at least on Linux (the Solaris docs are not precise about this), the default for SIGCHLD seems to be "ignore". That's why I was so puzzled, that in my Perl program I had to explicitly set it to IGNORE to get things working...

      -- 
      Ronald Fischer <ynnor@mm.st>
        There is a low level data structure called sigaction. Without diving into the Perl guts, I really don't know what Perl does to this structure. But apparently it does something different than just not writing to it at all - I presume whatever Perl does is an attempt at multi-platform operation that on your particular platform does something that you didn't expect.
Re: Monitoring child processes
by Marshall (Canon) on Mar 13, 2012 at 15:54 UTC
    When you are running a normal program from the shell, it can return back an exit status back to the shell when it quits, exit(12), etc. The same sort of thing can happen when a child process exits. The parent gets a SIGCHLD when the child dies and it can go look at the child's exit status if it wants to.

    The child's entry remains in the process table as a place to maintain this status. A child in this state is sometimes called a "zombie" and the process of reading the status is called "reaping" the child. There is an option in waitpid to get the exit status, but in practice most common is to just throw it away (parent doesn't care). So that is what SIGCHLD is about. If you don't care (a) that the child died or (b) what its exit status was, then you can ignore the SIGCHLD signal and the OS will throw the exit status away for you without the parent having to do that.

    This behavior is of course OS specific - most portable thing is to explicitly say what you want to have happen (either by installing your signal handler or setting it to "ignore". So if the parent doesn't care about (a) or (b), then there is nothing wrong with having the status "auto-reaped" and discarded.

Re: Monitoring child processes
by repellent (Priest) on Mar 14, 2012 at 05:41 UTC
    kill(0, $pid) will return true even if the $pid process is a zombie. By setting $SIG{CHLD}='IGNORE'; before forking, you're having the parent process reap the zombie child upon receiving SIGCHLD, causing your call to kill(0, ...) to then work as expected.

    I believe (someone correct me please) that the default perl handler for SIGCHLD is to do nothing. It is left up to the programmer to reap the child processes manually (via waitpid, or setting 'IGNORE').

      In a loop, testing the aforementioned condition, and killing the childs if the condition is met

    Even after killing the children (e.g. kill(9, $pid), though I hope you're nicer with kill(15, $pid)), you still need to reap them. You're doing fine since you set 'IGNORE'.
      I believe (someone correct me please) that the default perl handler for SIGCHLD is to do nothing. It is left up to the programmer to reap the child processes manually (via waitpid, or setting 'IGNORE').

      Yes, I think that is right.

Re: Monitoring child processes
by sundialsvc4 (Abbot) on Mar 13, 2012 at 21:08 UTC

    The root problem, which is extremely difficult to deal with, is basically a race-condition:   the parent might “determine” the status of a child, but, before it can react to the status that it has thusly determined, the status of the child has changed.   Strictly speaking, you don’t even know that your list-of-children is instantaneously correct.

    Obviously, the most desirable thing to do would be to schlep the entire responsibility off to an existing known-good CPAN module, such as, say, Parallel::ForkManager.   Can you find a way to do that?

    Otherwise, I suggest that you should devise that the only role of the parent process/thread should be “to run the nursery.”   All of the other responsibilities, including checking whether a particular condition has occurred, ought to be the responsibilities of children.   If that special child informs you that, indeed, “ka-ka has occurred,” the parent should respond by issuing a signal to every one of its children that asks them to “please die, as soon as you possibly can,” then waits for them all to do so.   It does not poll them to see if they are alive:   it does not have to.   If at all possible, it also does not kill them.   (How messy ... and, how unpredictable.   Each child, once alive, is responsible for setting its own affairs in order upon the occasion of its own death... timely or otherwise.)

    I have consistently found (and, maybe it’s just me ...) that if you try to give the parent process many responsibilities of its own to take care of in addition to “watching the kids,” the kids get into trouble in ways that you could not possibly have anticipated and could never reproduce.   (This being a case in which computers imitate real life!!)

      The root problem, which is extremely difficult to deal with, is basically a race-condition: the parent might “determine” the status of a child, but, before it can react to the status that it has thusly determined, the status of the child has changed. Strictly speaking, you don’t even know that your list-of-children is instantaneously correct.

      This is not correct. There is no "race-condition" in properly implemented code. The OS handles some things "atomically" (I don't mean automatically - that is different - "atomic" means in a single operation) that you cannot do for yourself. The SIGCHLD like other signals is a level sensitive thing (not edge triggered), meaning that when multiple children exit close to one another, you only get one SIGCHLD signal.

      When the SIGCHLD is "delivered" (the handler starts running) the OS atomically blocks that signal. This is different than you setting the sigprocmask yourself in the handler. Basically while you are messing around in your handler, this allows the possibility of an additional SIGCHLD to arrive and be in a "pending" but "undelivered" state.

      The classic SIGCHLD handler processes all of the children via the waitpid() function (and there may very well be multiple children to process). If say 5 children exit while you are messing around in the handler. This fact is noted by the OS and this becomes yet another SIGCHLD (a single level triggered signal) in the "pending but undelivered" state.

      When you exit the handler, this "pending" SIGCHLD is unblocked and you immediately get another SIGCHLD signal. Basically this ensures that you will not "miss one" - that is the important part that eliminates the "race condition". The OS has to do this and it does.

      I think that it is possible under certain circumstances for you to get a SIGCHLD where there is "nothing to do" because its already been handled (while you were just in the signal handler).

      Basically, the "race condition" is handled by the OS and there is not a possibility of "missing a SIGCHLD event" as long as you process all available children while you are in the SIGCHLD handler.

      use the waitpid() function to reap children. Let the OS do the job of deciding who is ready to "reap" or not. There is no need for the parent to maintain its own "children" list, if that is what you meant.

        Thank you very much (and also many thanks to the others who contributed to this thread) for the elaborate responses. It really helped me a lot!

        -- 
        Ronald Fischer <ynnor@mm.st>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://959364]
Approved by moritz
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-03-28 23:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found