kennethk has asked for the wisdom of the Perl Monks concerning the following question:

As part of some quality control, I've been writing a script that tests whether a homebrew command line utility returns physically-appropriate outputs over a very large parameter space. Essentially, I generated a work queue in the master thread and then in each slave captured the output with backticks and ran it through some numerical tests. This has been functioning for about a month.

Yesterday, I had a rude awakening when I managed to take down the server my test suite was running on. Analysis after a hard reboot of the server showed that in exceptional cases, the utility seems to start grabbing memory in an infinite loop - it consumes the 64GB of physical memory in a few minutes. So today's job was modifying the test so it would abort a given test after a suitable timeout.

Looking through the archives, I first tried following Re: killing a program called with system() if it takes too long? by implementing an alarm (in a single threaded version), which resulted in the creation of zombies that continued chewing through my memory. In order to actually kill the computations, I combined a pipe with repeated calls to `ps --ppid` and kill to get all spawned processes (inspired by almut's Re^2: Killing children's of children). I then discovered that alarms don't work in a threaded context, so I followed ikegami's advice in Re^3: Threads and ALARM() - There is a solution or workaround? and used select on the pipe instead of signals, resulting in some code that seems to be working (special thanks to duff's Using select and IO::Select tutorial). So I replaced

        my $content = `$command 2>&1`;

with

my $pid = open my $calc, '-|', "$command 2>&1" or die "Pipe failed on open: $!\n"; my $vector = ''; vec($vector,fileno($calc),1) = 1; # Flip bit for pipe unless (select($vector,undef,undef,$timeout)) { # Calculation +is hanging # collect list of spawned processes my @pids = $pid; my $i = -1; push @pids, `ps --ppid $pids[$i] -o pid` =~ /\d+/g while + ++$i < @pids; kill 9, $_ for @pids; die "Calculation call failed to return with $timeout secon +ds\n"; } local $/; # Slurp my $content = <$calc>; close $calc;

where both are wrapped in eval blocks. So my questions are:

  1. Is this a reasonable way to be handling this? Have I missed something that will come back to bite me?
  2. I do not feel entirely comfortable with my pid collection strategy. Is there a more robust way to handle "kill all spawned processes"?

Replies are listed 'Best First'.
Re: Handling badly behaved system calls in threads
by BrowserUk (Patriarch) on Aug 26, 2010 at 22:19 UTC

    Here's a simple and reliable way of doing it:

    #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; our $TIMEOUT //= 10; our $THREADS //= 16; our $PROCESSES //= 100; my $Q = new Thread::Queue; my $semSTDOUT :shared; sub tprint{ lock $semSTDOUT; print @_; } sub worker { my $tid = threads->tid; while( my $time = $Q->dequeue ) { my $pid :shared; my $th = async{ $pid = open my $pipe, '-|', qq[ perl -le"sleep $time; print \$\$, ' done'" ] or warn "Failed to start perl with time: $time" and nex +t; local $/; return <$pipe>; }; sleep 1; tprint "$tid: Started pid $pid for $time seconds"; my $t = 0; sleep 1 while kill( 0, $pid ) and ++$t < $TIMEOUT; if( kill( 0, $pid ) ) { tprint "$tid: $pid still running; killing"; kill 9, $pid; } else { tprint "$tid: $pid completed sucessfully"; } my $result = $th->join; tprint "$tid: pid $pid returned: '$result'"; ## Check result; } } my @workers = map async( \&worker ), 1 .. $THREADS; $Q->enqueue( map int( rand 2 * $TIMEOUT ), 1 .. $PROCESSES ); $Q->enqueue( (undef) x $THREADS ); $_->join for @workers; __END__ [23:21:02.98] c:\test>857569 -THREADS=4 -PROCESSES=10 -TIMEOUT=10 1: Started pid 4032 for 2 seconds 4: Started pid 3848 for 15 seconds 2: Started pid 3436 for 17 seconds 3: Started pid 2392 for 14 seconds 1: 4032 completed sucessfully 1: pid 4032 returned: '4032 done ' 1: Started pid 3640 for 11 seconds 4: 3848 still running; killing 4: pid 3848 returned: '' 2: 3436 still running; killing 2: pid 3436 returned: '' 3: 2392 still running; killing 3: pid 2392 returned: '' 4: Started pid 2872 for 7 seconds 3: Started pid 3604 for 3 seconds 1: 3640 still running; killing 1: pid 3640 returned: '' 3: 3604 completed sucessfully 3: pid 3604 returned: '3604 done ' 1: Started pid 1156 for 6 seconds 3: Started pid 4452 for 3 seconds 4: 2872 completed sucessfully 4: pid 2872 returned: '2872 done ' 3: 4452 completed sucessfully 3: pid 4452 returned: '4452 done ' 1: 1156 completed sucessfully 1: pid 1156 returned: '1156 done '

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for giving me something to chew on. I modified your script to run some of these corner cases and everything seems to behave correctly. I'm confused however why your script correctly collects all children when my attempt killed the shell but not the badly behaved grandchild. My code was something like (it has since been deleted, single-threaded version):

      my $pid = open my $calc, '-|', "$command 2>&1" or die "Pipe failed on +open: $!\n"; local $SIG{ALRM} = sub { kill 9, $pid; die "Calculation call failed to return with $timeout seconds\n"; }; alarm $timeout; local $/; # Slurp my $content = <$calc>; close $calc; alarm 0;

      Is there an obvious behavioral difference? I'm pretty sure this is very close to what I had between Re: killing a program called with system() if it takes too long? and Re^2: Killing children's of children.

        Is there an obvious behavioral difference?

        I missed: 2>&1.

        You're starting a shell, that redirects stderr to stdout and runs perl, that run the script. Hence the pid returned to you from open is the shell, not perl.

        The simplest solution, given the script is yours, is to do that redirection within the script. Optionally, make the redirection dependant upon a command line parameter used for testing only.

        Something like:

        open STDEER, '&=', fileno( STDOUT ) or ... if $TESTING;

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Handling badly behaved system calls in threads
by ikegami (Patriarch) on Aug 26, 2010 at 23:01 UTC
    To kill all children, it should kill the process group (kill -$sig => $$? setsid + kill -$sig => 0?) minus itself (sig_action?).
      I came across this idea in digging through the archives (the exact posts or Perl documentation escape me) but thought it was not worth implementing since all calculations, good and bad, should be in the same process group as perl - I used threads and not fork for my model. Was I incorrect in this assumption? At any given time, I might have 20 calculations running and bad cases are a fraction of a percent of the total.

        You're implying that if you use fork (i.e. create child processes), the workers won't be in the same process group. That's not true. A process group is a group of processes, not a group of threads. (Mind you, all your threads will be in the same process group by virtue of being in the same process.)

        if (!fork()) { sleep; exit(0); } if (!fork()) { sleep; exit(0); } kill TERM => $$; # Only kills self
        if (!fork()) { sleep; exit(0); } if (!fork()) { sleep; exit(0); } kill -TERM => $$; # Kills all three processes
Re: Handling badly behaved system calls in threads
by zentara (Cardinal) on Aug 27, 2010 at 11:59 UTC
    You might be creating strawmen ( relatives of zombies), as ikegami pointed out in Re^3: Stopping subprocesses

    That is the first time I've seen the strawman behavior, it's like grandchildren that won't die as part of a kill command. Only saving the pid of the command or it's associated shell, and killing it's pid will clean them up.

    As you switch from backticks, to a form of IPC that gives the pid, have your slave threads either kill the pid themselves before joining(ending), or have them stuff the pid into a thread shared variable, so the master thread can do the cleanup.

    In my experience, I always use Proc::Killfam on the pid obtained in the slave thread, since it will kill the shell and program pid.


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
      Thanks for the info. I'll check Proc::Killfam out, though I try to avoid non-core dependencies when I can on this box as it's an Ubuntu server.

      I think BrowserUK identified why I was creating strawmen in Re^3: Handling badly behaved system calls in threads - I'm still working through the code examples he's posted for avoiding invoking the shell. Not relying to feeding backticked system calls through regular expressions appeals to me greatly.