I think I've got something different. There's no evidence of a succeeded accept followed by a failed fork (which as I understand it is the essence of what was being described). I just watched an 'incident' just now, and for every succeeded accept, there was a successful fork. My hangup monitor showed an earlier hangup like this:
2011-05-10 14:39:52|No systems connected to listening sender since 201
+1-05-10 14:29:31
tcp 0 1413 xxx.xxx.143.138:6023 yyy.yyy.1.172:redstorm
+_join ESTABLISHED 29021/perl
2011-05-10 14:39:52|Restarting daemons
root 29021 1 0 13:51 pts/2 00:00:05 /usr/bin/perl /usr/sbi
+n/caps/lsenders.pl -d
root 29029 29021 0 13:51 pts/2 00:00:00 /usr/bin/perl /usr/sbi
+n/caps/lsenders.pl -d
this shows the output of a call to netstat showing that a connection from a particular node was attached to the 'main' listener process (not a forked 'servicer'). The situation is rather confused by the fact that there is a permanently forked-off 'monitor' process (2029) that (I hope) isn't involved in any way in this problem. When this happens, the 'stuck' node is visibly 'stuck' to the listener indefinitely
|