gmantz has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody,
I am writing a simple program to spawn a number of processes on a Solaris system (like last, uptime etc) using open.
I need to know when a child process has terminated so i installed a signal handler for CHILD to a method that uses waitpid with nohang.
The script works ok except that it misses on some dead child processes and i end up having spawned 10 procs and reaped 6 or 7. Anyone know a workaround?
What if i run the signal handler inside a thread?
Any suggestion would be greatly appreciated.

George

Replies are listed 'Best First'.
Re: $SIG{CHILD} misses on zombies
by Tanktalus (Canon) on Sep 05, 2005 at 14:02 UTC

    What does your signal handler look like? My guess is that you're only waiting for a single child when two or three could have all exited at about the same time.

    sub child_handler { 1 while -1 != waitpid -1, WNOHANG; }

    This way, if there are multiple children who have recently died, you'll get them all. Of course, if all you're trying to do is ignore the children's return codes and prevent zombies, just set $SIG{CHLD} to 'IGNORE', and perl will just Do What I(you) Mean.

      Hi and thanks for the reply.
      I have written the same code as your example, with -1 as the pid parameter to waitpid. So i think i am waiting for every child to die.
      The problem is that when a child dies, the child_handler routine runs, but if another child dies at the same time that the routine is handling the previous child, the last death is ignored (i.e the program counter is on the handler routine already). I dont want to ignore children return codes and simply prevent zombies. I want to know if everything i spawned is running/has exited

        George,

        Are you doing the waitpid in a loop like I did? That should clear up all zombies that are waiting at that point.

        sub child_handler { my $i; ++$i while -1 != waitpid -1, WNOHANG; print "Reaped $i children\n"; }
        This may help you see how many children were waiting at a given time. You can also throw in more debugging such as log the output from ps or something to see the states of all processes on the system, then you can manually identify what is actually running, zombied, etc., both before and after the waitpid loop.

Re: $SIG{CHILD} misses on zombies
by QM (Parson) on Sep 06, 2005 at 15:46 UTC
    I think the other replies probably have it nailed, but I've encountered issues before writing my own forking script (it was also on Solaris, though I don't know if that made a difference).

    I spent what seemed like weeks tracking down all of the places where it could go wrong, and leave zombies or just "hang" waiting on children. Finally I came up with the code in Re: Concurrent Processes, and never had any problems (at least, not on Solaris).

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of