halecommarachel has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I had a simpler version of the below code working, but when I added some complexity, the signal handler for CHLD stopped reaping zombies. Here is my code now:

#!/tools/xgs/perl/5.8.8/bin/perl -w use strict; use lib '/home/fisusr/perl-packages/'; use Benchmark; use Getopt::Long; use Data::Dumper; use POSIX ":sys_wait_h"; $| = 1; my %children; my $max = 16; # monitor child processes $SIG{CHLD} = sub { local ($!, $?); my $pid = waitpid(-1, WNOHANG); return if $pid == -1; return unless defined $children{$pid}; delete $children{$pid}; unless (kill 0 => $pid) { print "Killed child $pid\n"; } }; my $dest = "/scratch/du/sb"; my @dirs; my @skip; my $help = 0; my $limit; GetOptions( 'max=i' => \$max, 'dest=s' => \$dest, 'limit=i' => \$limit, 'dir=s' => \@dirs, 'skip=s' => \@skip, 'help' => \$help ); usage() if ($help); if ($max > 16) { print "Maximum number of child processes has been set to $max; pro +ceed? "; chomp(my $ans = <>); unless ($ans =~ /y|yes/i) { print "Exiting ...\n"; exit(1); } } usage("$dest is not a directory") unless (-d $dest); usage("-dir is required") unless (@dirs); foreach my $dir (@dirs) { usage("$dir does not exist!") unless (-d $dir); } my %dirs = map { $_ => 1 } @dirs; my %skip = map { $_ => 1 } @skip; multi_dir(\%dirs,\%skip,$limit); sub multi_dir { my ($dirs,$skip,$limit) = @_; foreach my $dir (keys %$dirs) { if(opendir(my $dh,$dir)) { (my $name = $dir) =~ s/\//_/g; mkdir "$dest/$name" unless (-e "$dest/$name"); my @subs = sort grep { !/^\.+/ && -d "$dir/$_" } readdir($ +dh); closedir($dh); if ($limit) { @subs = @subs[1 .. $limit]; } my $total = @subs; print "There are $total directories/files to evaluate ...\ +n"; my $count = 0; foreach my $sub (@subs) { next if (exists $skip->{$sub}); next if ($sub eq $dir); while (keys %children >= $max) { } $count++; die "Cannot fork: $!\n" unless defined(my $child_pid = + fork()); if ($child_pid > 0) { $children{$child_pid} = 1; print "There are currently ", scalar(keys %childre +n), " child processes\n"; next; } elsif ($child_pid == 0) { system("/proj/fisdata/fisusr/sandboxes/sasi/tools/diskusagereports_v20 +1/scripts/find.sh $dir/$sub -type d -o -type f > $dest/$name/$sub.dat +"); # print "/proj/fisdata/fisusr/sandboxes/sasi/tools +/diskusagereports_v201/scripts/find.sh $dir/$sub -type d -o -type f > + $dest/$name/$sub.dat"; print "Child [$$] running find.sh on $sub $coun +t/$total\n"; } exit; } } else { warn "Could not read $dir: $!\n"; } } }

Here is a sample run:

[fisusr@somemachine /scratch/du/sb]$ /proj/rdi-xco/staff/rachelh/scrip +ts/du_reports.pl -dir /proj/fisdata/fisusr/ibs -limit 2 There are 2 directories/files to evaluate ... There are currently 1 child processes There are currently 2 child processes [fisusr@somemachine /scratch/du/sb]$ Child [14218] running find.sh on +2012.2 1/2) Child [14220] running find.sh on 2012.3 2/2

When I had the code working, I would see the "killed child" messages; now the script hangs

Replies are listed 'Best First'.
Re: CHLD handler stopped working
by kcott (Archbishop) on May 12, 2014 at 21:11 UTC

    G'day halecommarachel,

    "I had a simpler version of the below code working, but when I added some complexity, the signal handler for CHLD stopped reaping zombies. Here is my code now:"

    Actually showing us what was working and what you added to stop it working would have been very useful. As it stands, we have no way of telling what change caused the failure.

    Furthermore, cutting your code down to the absolute minimum which still reproduced your problem may have helped you to solve the problem by yourself. Even if you still couldn't solve it, we wouldn't have to wade through totally irrelevant code looking for a problem.

    Your immediate problem is the position of the exit statement. You probably intended to exit only from the child process.

    Here's a cut-down version of your script which (roughly) reproduces the problem you described:

    #!/usr/bin/env perl -l use strict; use warnings; use autodie qw{:all}; use POSIX qw{WNOHANG}; my %children; $SIG{CHLD} = sub { local ($!, $?); my $pid = waitpid(-1, WNOHANG); return if $pid == -1; return unless defined $children{$pid}; delete $children{$pid}; print "SIG{CHLD}: PID $pid ", kill(0 => $pid) ? 'NOT ' : '', 'reap +ed.'; }; print "PARENT($$): Before multi_dir() call"; multi_dir(); sub multi_dir { for (0 .. 1) { my $pid = fork; die 'Undefined PID from fork()' unless defined $pid; if ($pid) { # parent $children{$pid} = 1; print "PARENT($$): Started child process with PID: $pid"; print "PARENT($$): ", scalar keys %children, ' child proce +sses'; } else { # child print "CHILD($$): New process started"; system 'echo PPID=$PPID PID=$$; sleep 1'; print "CHILD($$): system() finished"; } exit; # <<<=== PROBLEM HERE! } }

    By moving the exit to the child code, the script no longer hangs:

    sub multi_dir { ... print "CHILD($$): system() finished"; exit; # <<<=== PROBLEM FIXED! } } }

    Having said that, I recommend you read what exit says about exiting from subroutines. Even if the simple fix I described above works for you now, when you next add a little more "complexiy" you may find yourself with problems again. Perhaps you could return with the child PID instead of using exit; then, if the end of &multi_dir is reached, return with zero — that way, you can test the return value of multi_dir() and exit if it's the child or continue with the main script if it's the parent. Here's an example:

    ... print "PARENT($$): Before multi_dir() call"; my $multi_dir_return = multi_dir(); if ($multi_dir_return) { print "CHILD($multi_dir_return): exiting"; exit; } print "PARENT($$): After multi_dir() call"; sub multi_dir { ... print "CHILD($$): system() finished"; return $$; } } return 0; }

    Your code has some other (potential) problems. Here's a couple I spotted:

    This:

    if ($limit) { @subs = @subs[1 .. $limit]; }

    would probably have been better as something closer to:

    if (defined $limit and $limit > @subs) { @subs = @subs[0 .. $limit - 1]; }

    And this:

    while (keys %children >= $max) { }

    is potentially an infinite loop. It's also repeatedly checking the condition, and doing nothing else, which is chewing up a lot CPU cycles unnecessarily. Consider adding a sleep with an alarm (see Time::HiRes for times with finer granularity than whole seconds).

    -- Ken

      Hi Ken,

      About the 'exit' statement, I have a 'next' in the parent's 'if' statement, so only the child reaches the 'exit'.

      The potential 'while' infinite loop - is there a better way to wait for the number of child processes to drop below the max?

      Thanks, Rachel

        Firstly, you've changed your OP without indicating what you've changed. Please do not do this! It's perfectly OK to edit your nodes but do ensure you clearly mark what you've changed: "How do I change/delete my post?" explains the how and why of this.

        "About the 'exit' statement, I have a 'next' in the parent's 'if' statement, so only the child reaches the 'exit'."

        The exit statement is still not clearly bound to the child code. The exit statement is still in the subroutine definition. Is there a reason you do not wish to use return.

        "The potential 'while' infinite loop - is there a better way to wait for the number of child processes to drop below the max?"

        I've already suggested using sleep and alarm. Do you have a problem with that suggestion? If so, what is it?

        -- Ken

Re: CHLD handler stopped working
by Old_Gray_Bear (Bishop) on May 12, 2014 at 21:02 UTC
    What code did you add? Remove it and determine if the CHLD handler works.

    Start adding your 'complexities' one at a time and verify that the Handler is still working after each add. When it stops working, the last thing the you added is the one with the Bug. Fix it.

    ----
    I Go Back to Sleep, Now.

    OGB