in reply to Handling badly behaved system calls in threads

Here's a simple and reliable way of doing it:

#! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; our $TIMEOUT //= 10; our $THREADS //= 16; our $PROCESSES //= 100; my $Q = new Thread::Queue; my $semSTDOUT :shared; sub tprint{ lock $semSTDOUT; print @_; } sub worker { my $tid = threads->tid; while( my $time = $Q->dequeue ) { my $pid :shared; my $th = async{ $pid = open my $pipe, '-|', qq[ perl -le"sleep $time; print \$\$, ' done'" ] or warn "Failed to start perl with time: $time" and nex +t; local $/; return <$pipe>; }; sleep 1; tprint "$tid: Started pid $pid for $time seconds"; my $t = 0; sleep 1 while kill( 0, $pid ) and ++$t < $TIMEOUT; if( kill( 0, $pid ) ) { tprint "$tid: $pid still running; killing"; kill 9, $pid; } else { tprint "$tid: $pid completed sucessfully"; } my $result = $th->join; tprint "$tid: pid $pid returned: '$result'"; ## Check result; } } my @workers = map async( \&worker ), 1 .. $THREADS; $Q->enqueue( map int( rand 2 * $TIMEOUT ), 1 .. $PROCESSES ); $Q->enqueue( (undef) x $THREADS ); $_->join for @workers; __END__ [23:21:02.98] c:\test>857569 -THREADS=4 -PROCESSES=10 -TIMEOUT=10 1: Started pid 4032 for 2 seconds 4: Started pid 3848 for 15 seconds 2: Started pid 3436 for 17 seconds 3: Started pid 2392 for 14 seconds 1: 4032 completed sucessfully 1: pid 4032 returned: '4032 done ' 1: Started pid 3640 for 11 seconds 4: 3848 still running; killing 4: pid 3848 returned: '' 2: 3436 still running; killing 2: pid 3436 returned: '' 3: 2392 still running; killing 3: pid 2392 returned: '' 4: Started pid 2872 for 7 seconds 3: Started pid 3604 for 3 seconds 1: 3640 still running; killing 1: pid 3640 returned: '' 3: 3604 completed sucessfully 3: pid 3604 returned: '3604 done ' 1: Started pid 1156 for 6 seconds 3: Started pid 4452 for 3 seconds 4: 2872 completed sucessfully 4: pid 2872 returned: '2872 done ' 3: 4452 completed sucessfully 3: pid 4452 returned: '4452 done ' 1: 1156 completed sucessfully 1: pid 1156 returned: '1156 done '

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^2: Handling badly behaved system calls in threads
by kennethk (Abbot) on Aug 26, 2010 at 22:59 UTC
    Thanks for giving me something to chew on. I modified your script to run some of these corner cases and everything seems to behave correctly. I'm confused however why your script correctly collects all children when my attempt killed the shell but not the badly behaved grandchild. My code was something like (it has since been deleted, single-threaded version):

    my $pid = open my $calc, '-|', "$command 2>&1" or die "Pipe failed on +open: $!\n"; local $SIG{ALRM} = sub { kill 9, $pid; die "Calculation call failed to return with $timeout seconds\n"; }; alarm $timeout; local $/; # Slurp my $content = <$calc>; close $calc; alarm 0;

    Is there an obvious behavioral difference? I'm pretty sure this is very close to what I had between Re: killing a program called with system() if it takes too long? and Re^2: Killing children's of children.

      Is there an obvious behavioral difference?

      I missed: 2>&1.

      You're starting a shell, that redirects stderr to stdout and runs perl, that run the script. Hence the pid returned to you from open is the shell, not perl.

      The simplest solution, given the script is yours, is to do that redirection within the script. Optionally, make the redirection dependant upon a command line parameter used for testing only.

      Something like:

      open STDEER, '&=', fileno( STDOUT ) or ... if $TESTING;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        By "redirection within the script", do you mean perform the redirection in the worker thread or in the called utility?

        Worker: I already redirect STDERR and STDOUT into buffers (local scalars) in every worker so that families of computations have their results contiguously located in the final report. I tried this as well as a bidirectional pipe (Bidirectional Communication with Yourself) but it fails to catch the error output by the child w/o the explicit command line redirect. I suppose I could implement it with IPC::Open3.

        Utility: While the utility is developed by my group, I am not the programmer responsible. It's a C binary (heavy computation) and I philosophically I think it uses channels correctly. Plus the large number of issues my testing script has uncovered in the previous month has not left me as that gentleman's favorite person.