jafoba has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a script that will fork a controlled number of subprocesses. Each subprocess will execute a command line via system so that I can reap the logs created by that process. It is possible for the executable to stop itself or so it seems. The problem is when the subprocess stops like this the parent also stops. I can force them to complete by sending SIGCONT to the child pid and then the parent pid. I thought that WNOHANG would keep the parent from hanging. I tried to capture the stop in both the parent and child but no luck. Any ideas how to get around this? I have read perlipc and have written other similar scripts that are working.

Included is some code that shows what I am attempting to do. It only forks one child but the real one will fork multiple. Perl 5.8.2 on AIX 5.3.

#!/usr/bin/perl use strict; use lib '/homegrown/lib'; use MY_Logging; use MY_File qw(slurp); use POSIX qw(:signal_h :errno_h :sys_wait_h); select STDERR; $|=1; our ($me) = $0 =~ m/\/?([^\/]+)$/; my %children; my $child; $SIG{CHLD} = \&REAPER; sub REAPER { my $reaped_pid = waitpid(-1, &WNOHANG); if ($reaped_pid == -1) { } elsif (WIFEXITED($?)) { delete $children{$reaped_pid}; print STDERR slurp("$me.$reaped_pid.log"); } else { logMsg("False alarm on $reaped_pid"); } $SIG{CHLD} = \&REAPER; } if ($child = fork()) { $children{$child} = 'TEST'; logMsg("spawned pid $child"); } else { close(STDERR); my $LOG = "$me.$$.log"; open (STDERR, ">>$LOG") or dieScreaming("$LOG open failure: $!"); select STDERR; $|=1; foreach my $host ('eidspstd01', 'eiqspstd01') { my $cmd = "..."; logMsg("executing $cmd"); my $CMD_LOG = "$me.$$.load.log"; system("$cmd >$CMD_$LOG 2>&1"); logLog($CMD_LOG); # pull in cmd log contents to STDERR } logMsg("completed TEST"); close(STDERR); exit(0); } # wait for any children that may be running still while (%children) { logMsg("WATING FOR: ", join(',', values %children)); sleep(5); } logMsg("$me complete");

Replies are listed 'Best First'.
Re: Stopped child hangs parent
by Argel (Prior) on Aug 06, 2009 at 23:20 UTC
    Is there a reason you cannot use alarm?

    Elda Taluta; Sarks Sark; Ark Arks

      Using alarm would mean that there is a max amount of seconds the child process would take. The purpose of this script is to load files into a database. The size of the files will always differ and so the length of time would also. I am assuming you mean to wrap the child in an alarm eval. Please correct me if I am thinking of alarm incorrectly.

        You didn't really indicate how long the system calls take. If you know they normally take e.g 4 minutes then an alarm of five minutes would resolve your problem. Though that's really more of a workaround. I will bump this to the frontpage to increase the chances that soemone more familiar with with forking and signal handling will look at at it.

        Elda Taluta; Sarks Sark; Ark Arks

Re: Stopped child hangs parent
by ig (Vicar) on Aug 08, 2009 at 10:24 UTC
    It is possible for the executable to stop itself or so it seems.

    I take it your are talking about the command run by system(). What do you know about the state of the process when it is stopped, what causes it to stop and what its natural course is if you don't signal it? Can you say what the command is?

    The problem is when the subprocess stops like this the parent also stops.

    If you mean that the child process of your main perl process waits until the system call returns, this is the way system works. If you mean the parent perl process stops, what do you know about the process when it is stopped? Is it some particular function that does not complete?

    I tried to capture the stop in both the parent and child but no luck.

    Do you mean that the signal was received by your process but your signal handler was not called? Or was no stop signal received?

      The command is arsload which is a document ingestion utility for IBM DB2 OnDemand Content Manager. I am forcing an error condition which causes this situation. When I run the same command on the command line I get a message telling the pid was stopped. When running my script I am sing top and see the parent and child threads in 'stop' state. If I do not send the SIGCONT signal nothing happens. The processes remain in the 'stop' state. If I send the signal just to the child, the parent remains stopped. I have to signal the parent as well.

      As for the parent, the while loop at the end of the script should output every 5 seconds to show it is waiting. Nothing is output even if I signal the child to continue which ends the child.

      I tried SIG{STOP} = sub {print STDERR "$$ stopped\n";}; in the child and nothing was output. I then tried it in the parent and nothing was output.

      Maybe I am being mislead by the stop state thinking it is the stop signal? For debug purposes only, I have been toying with the idea of putting in a signal handler for almost every signal.

        When the error condition occurs I expect you will have three processes: a parent perl process, a child perl process and an arsload process. Are all three stopped?

        SIGSTOP can't be caught or ignored but SIGTSTP, SIGTTIN and SIGTTOU can. All four signals stop a process by default. The latter two are usually generated only for processes running in the background. Are you running your program in the background?

        You could initiate a separate process group in your child process: see Complete Dissociation of Child from Parent. If you do this then any signals sent from the arsload process to all processes in the process group should not affect the parent process.

        Your parent process should receive a SIGCHLD when its child is stopped. If it is not stopped itself then it can send a SIGCONT.