in reply to rsh <defunct> processes appear when using fork and system calls

I suspect the problem is with mixing CHLD signal handling with system.
You might try explicitly calling waitpid in the parent rather than using a signal handler.
Why does the child have to wait for the grandchild to finish? You could set SIGHUP to IGNORE, which should avoid killing the grandchild when the child exits, and use exec (or use nohup, for the lazy).
  • Comment on Re: rsh <defunct> processes appear when using fork and system calls

Replies are listed 'Best First'.
Re^2: rsh <defunct> processes appear when using fork and system calls
by whatwhat (Novice) on Aug 07, 2007 at 08:36 UTC
    I'll try rewriting my code with waitpid in the parent.

    Although I have experience with perl, I am a novice in forking processes, etc. So i don't know if this is the best, in my case 'best' means stability, code for doing what I want to do.

    My reasoning for having the child wait for the grandchild is two-fold. One the grandchild is a non-perl program that runs for ~1 minute. So if I didn't use 'system', which forks and waits for its child to finish, and used 'exec' instead (which as I understand it just executes the command, and does not wait for its child to finish) then I would put multiple jobs (each needing heavy CPU usage) running on one node, instead of running them serially. This is extremelly undesirable outcome for me. The second reason I have the child wait for the grandchild is that I need to analyze the output of the 24 non-perl programs when they are finished, the only way I know how to do that is in side the child signal reaper.

    Please let me know if my assumptions above regarding 'exec' are incorrect. If I could use your suggestion and get rid of the defunct rsh processes, while still meeting the criteria above, please let me know.

    Thanks!

      exec is a little different to how you describe, although the effect may be similar. The exec command does a rather peculiar thing: it replaces the current program with a different one in the same process. There is no return from a successful exec, since the original code is lost. Some things do survive an exec, the PID, open file handles, DEFAULT or IGNORE signal disposition, and ENVironment variables, to mention a few.

      You imply you are doing some processing after the call to system, yet looking at your code you appear to be doing an exit(0). If it got to that point then the exec did not work, so a die (showing $!) might be better.

      In the parent it is perfectly normal to call waitpid, there is no need to place it in a CHLD signal handler. Using an argument of -1 (no need for WNOHANG) will wait on the next child to finish, and return its PID. For example (untested):
      while (keys %fhlist) { my $pid = waitpid(-1, 0); delete $fhlist{$pid}; }

      By the way, if you call waitpid (or its sister wait) for each of your children then you avoid zombies.