Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

When this sample script is run, it runs forever spawning new child processes and reaping them on a CHLD signal. However, eventually it catches a -1 and dies. I have tried this on Solaris and Linux, Solaris dies after a couple of minutes. Linux takes a little longer. I was under the impression that 5.8 had good signal handling What am i doing wrong?
#! /usr/bin/perl -wT use POSIX(); use strict; require 5.008; sub zap { $SIG{CHLD} = \&zap; # loathe sysV, dream of real POSIX while (my $pid = waitpid(-1, POSIX::WNOHANG())) { last unless $pid>0; if ($? != 256) { die("Caught a $?"); } } } $SIG{CHLD} = \&zap; my ($pid); while (1) { FORK: foreach (1..4) { if ($pid = fork) { } elsif (defined $pid) { sleep 1; exit 1; } elsif ($! =~ /No more process/) { sleep 5; redo FORK; } else { die("Can't fork"); } } } __DATA__ perl -V gives Summary of my perl5 (revision 5.0 version 8 subversion 0) configuratio +n: Platform: osname=solaris, osvers=2.8, archname=sun4-solaris-thread-multi uname='sunos hostname 5.8 generic_108528-15 sun4u sparc sunw,sun-b +lade-100 ' config_args='' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemulti +plicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -fno-strict-aliasing -D_LARGEFILE_ +SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O', cppflags='-D_REENTRANT -fno-strict-aliasing' ccversion='', gccversion='3.2.2', gccosandvers='solaris2.8' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=1 +6 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib ' libpth=/usr/local/lib /usr/lib /usr/ccs/lib libs=-lsocket -lnsl -ldl -lm -lrt -lpthread -lc perllibs=-lsocket -lnsl -ldl -lm -lrt -lpthread -lc libc=/lib/libc.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-G -L/usr/local/lib'

update (broquaint): changed <pre> to <code> + added formatting

Replies are listed 'Best First'.
Re: Signal Handling problems in 5.8
by tilly (Archbishop) on Jun 23, 2003 at 07:39 UTC
    I don't know the internals of signal handling well enough to know a definitive answer. However given that you are doing a pretty nasty stress test, I would not be surprised if you were able to smoke out very subtle races.

    With that in mind, when I look in perlipc I notice that there is a comment in there about SysV which says that on BSD and POSIX systems you don't need to reset the signal handlers, but on SysV you both need to reset it, and you need to reset it after the wait. I don't know Unix well enough to know what Tom Christensen is referring to there, or whether current Linux and Solaris would show the POSIX or the SysV behaviour. So I would test removing the resetting of the handler entirely, or failing that I would try moving that resetting inside of your loop.

    If neither of those helps and nobody else comes up with any useful suggestions, I would use the perlbug utility to submit a bug report to p5p. (If you do figure out the issue and see that there is a way to improve the documentation to make it clearer to other people, you can always send a patch to perlipc.pod in...)

Re: Signal Handling problems in 5.8
by perlplexer (Hermit) on Jun 23, 2003 at 21:23 UTC
    This may actually have nothing to do with Perl itself.
    Note that your parent process sleeps for 5 seconds when it detects that the system is no longer able to create new processes. Yet, child processes only sleep for 1 second. So, it is entirely possible that all of them finish before parent wakes up. The reason waitpid() would get a -1 in such a case is because CHLD signal handler gets called asynchronously and if one invocation gets interrupted somewhere before the code section that is responsible for obtaining child's PID, then the very last invocation will loop through and collect all zombies. When control is finally returned to the first invocation of the signal handler and it proceeds to get the PID, it will obviously return -1 because there will be nothing left to collect...

    --perlplexer


    Updated: fixed typos