Earindil has asked for the wisdom of the Perl Monks concerning the following question:

Having a real problem understanding forking and timeouts. I have a section of code that keeps hanging on bad nodes. I'm trying to kill off the fork if it doesn't complete in a certain amount of time. I think I'm on the right track, but so but not quite there yet. Can some kind brother lay some wisdom on me?
foreach my $node (@nodes) { chomp $node; wait_for_a_kid() if keys %pid_to_node > 4; if (my $pid = fork) { ## parent does... $pid_to_node{$pid} = $node; } else { local $SIG {ALRM} = sub { kill -15, $pid or die "kill: $!"; print "Killed PID $pid\n"}; eval { unless ($pid) { ## child does... setpgrp(0,0); exit !&NODE($node); } alarm 15; waitpid $pid => 0; }; } } ## final reap: 1 while wait_for_a_kid(); sub wait_for_a_kid { my $pid = wait; return 0 if $pid < 0; my $node = delete $pid_to_node{$pid} or warn("Why did I see $p +id ($?)\n"), next; 1; }
I ended up working around this problem by implementing a timeout on my LWP::Simple gets in the NODE subroutine which is where I was having the problem. But, I'd still love to understand this whole forking/timing out issue since I seem to encounter it frequently. -----------------------------------------------
Here's is some sample code I came up with using what has been mentioned so far. Everything seems to work here except it's not killing the process if it is > 1 sec old. Seems to be ignoring the alarm completely:
#!/bin/perl for ($i=0; $i<=10; $i++) { wait_for_a_kid() if keys %pid_to_node > 3; $pid = fork; if ($pid) { ## parent does... $pid_to_node{$pid} = $i; } else { print "$i $$\n"; local $SIG {ALRM} = sub { kill -15, $$ or die "kill: $!"; print "\tKilled PID $$\n"}; # Just SIGTERM. eval { ## child does... setpgrp(0,0); exit !&Test; alarm 1; waitpid $pid => 0; }; } } ## final reap: 1 while wait_for_a_kid(); sub wait_for_a_kid { my $pid = wait; return 0 if $pid < 0; my $node = delete $pid_to_node{$pid} or warn("Why did I see $p +id ($?)\n"), next; } sub Test { sleep 10; }
OUTPUT:

0 1668
1 1669
2 1670
3 1671
Why did I see 1668 (0)
5 1811
6 1812
7 1813
8 1814
9 1920
10 1921

Replies are listed 'Best First'.
Re: forking and timeouts
by moritz (Cardinal) on Aug 09, 2007 at 12:31 UTC
    In the else branch $pid is always zero, so in the signal handler you're trying to kill PID 0.

    I don't think that makes sense (and if it does you should write a literal 0 to make things clearer).

      Not sure I understand why it would always be 0. Wouldn't it only be 0 if it was the parent process, and an actual pid # if it's a child process?

        No it's the other way round. It will always be 0 in the child process, and the PID of the child process in the parent.

        To show it a bit clearer perhaps:

        $pid = fork(); if (!defined $pid) { print "Fork failed: $@"; } elsif ($pid) { print "I am the parent, my PID is $$ and my child's PID is $pid\n"; } else { print "I am the child, my PID is $$ and my parent's PID is " . getpp +id() . "\n"; }
Re: forking and timeouts
by ww (Archbishop) on Aug 09, 2007 at 12:47 UTC
    The test in the sub is suspect:
    return 0 if $pid < 0;
    Can a pid be a negative number? I think not (but stand to be corrected).

    Clarify, Correct: Can $pid be a negative number? I don't see anything in your code to support that (but still stand to be corrected).

      Can $pid be a negative number?

      In this case $pid is the return value of wait, which is (according to the docs) ...the pid of the deceased process, or "-1" if there are no child processes.

        ++, almut. Missed that.

        I do indeed stand corrected.