edan has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have a daemon that runs on a machine, with several sockets open to other nodes. We have a problem that when the system goes down (via reboot, shutdown, or the like), the socket gets hung up on the other side - I have been told this is a known problem with tcp sockets...

Anyway, the program basically runs in a infinite loop, looping on an IO::Poll call waiting for events on the sockets. I have installed sig handlers for INT, KILL, and TERM. The sig_handler basically sets a global variable that indicates the signal that was caught, and on every loop I check the global, and exit the loop if I got a signal. This seems to work when I send a signal via kill -TERM.

What doesn't work is when I reboot or shutdown the machine - my program doesn't seem to get the signal - the man page for shutdown and reboot seem to indicate that all processes are sent the TERM signal to allow them to shut down, etc., but I don't seem to be able to trap it.

Oh yeah, I'm running Linux (kernel rev 2.4.19)

(Apologies that it's a bit more OS'y than Perl'y)

The folllowing code illustrates what I'm doing:

#!/usr/bin/perl my $sig; print "my pid: $$\n"; $SIG{TERM} = \&handler; while(1) { if ($sig) { last; } select(undef, undef, undef, .2); # sleep for .2 sec } # clean up here `echo $sig > /tmp/sig.out`; # to check after reboot if it caught it exit; sub handler { $sig = shift; }

Any thoughts?

--
3dan

Replies are listed 'Best First'.
Re: (slightly OT) catching SIGTERM when system goes down
by robartes (Priest) on Apr 09, 2003 at 14:12 UTC
    At least on the RedHat machine I have here, processes are actually sent SIGTERM twice. Once from shutdown, and once from init. Init sends this signal to any process not belonging in the new runlevel it's switching to (in the case of a halt or reboot, 0 or 6 respectively), and your daemon should be included in this (if it's still in the original process group given to it's ancestor by init, or if it has become a child of init).

    Here's the clincher however: 5 seconds after this SIGTERM, init follows this up with a SIGKILL, which is not trappable. In your script, as Tomte has suggested, the actual action to be taken upon receipt of the signal is deferred until later (the handler just sets a flag variable). It might be the case that your script simply does not get to reading the flag variable before those 5 seconds are up. Try adding something in sub handler that logs receipt of the signal, so you can verify whether you actually receive it and are just too late to act upon it.

    CU
    Robartes-

Re: (slightly OT) catching SIGTERM when system goes down
by Tomte (Priest) on Apr 09, 2003 at 13:56 UTC

    try to echo inside the handler sub to see if you really don't get the signal...

    It isn't a good idea in general to use a fatal-signal handler for "clean-later" marks, do the clean up for fatal signals in the handler and exit there would be my general advice...

    regards,
    tomte


    Hlade's Law:

    If you have a difficult task, give it to a lazy person --
    they will find an easier way to do it.

      echo will not be seen during shutdown (where would it echo to?). Like I said, I tried it by sending the signal myself using kill, and that works. It's only during reboot that it doesn't work.

      I'm actually looking into the possibility that I realized only after I posted: I am running the program in a console and then typing reboot from another console. So won't my program get killed when the console is killed, before being killed by shutdown? If so, what signal does it get? - Gotta look into that, but now I just got pulled away on something else.

      Re: 'clean-later' marks: I thought it is best to do as little as possible inside a signal-handler because of problems with re-entrant system calls - that's why I set the global and do the clean-up work later. Am I wrong in this?

      --
      3dan

        echo will not be seen during shutdown

        I meant your echo system-call with redirection to the file.

        I'm actually looking into the possibility that I realized only after I posted: I am running the program in a console and then

        But still as a daemon? The console-shell being shut-down wont kill a daemonized prozess... (daemon, is that the correct spelling?)

        Re: 'clean-later' marks: I thought it is best to do as little as possible inside a signal-handler because of problems with re-entrant system calls - that's why I set the global and do the clean-up work later. Am I wrong in this?

        I guess you may be right under 'normal' circumstances, but a fatal signal like kill is, well, fatal; you either react or you don't, just setting a mark and going ahead won't work in the generell case, because your programm will likely be killed/terminated (as signaled) before it can check this mark (compare robartes post).

        regards,
        tomte


        Hlade's Law:

        If you have a difficult task, give it to a lazy person --
        they will find an easier way to do it.


        Edit: fixed typo

        So won't my program get killed when the console is killed, before being killed by shutdown? If so, what signal does it get?

        HUP

        --isotope
        http://www.skylab.org/~isotope/
Re: (slightly OT) catching SIGTERM when system goes down
by pg (Canon) on Apr 09, 2003 at 15:16 UTC
    To destruct TCP connections gracefully base on fatal signals does not seem to be a good idea.

    I would suggest to put this in the hands of the peer. Between two nodes that have TCP connections, and want to maintain it all the time, it is better to have heart beat sending back and forth, or at least one way (depends on your application). When the heartbeats is gone (not received for a while, or couple retries), the connection should be deemed as gone, then both sides should try to reestablish the connection.

    Generally speaking heartbeats should be handled on its own thread, so it is not blocked by anyone else.

    The other thing is that, you should look into your code, see whether there is any blocking call in your socket program. If you can, try to avoid them, so you always have the chance to check the connection.
Re: (slightly OT) catching SIGTERM when system goes down
by edan (Curate) on Apr 10, 2003 at 12:23 UTC

    OK, first, thanks to those that helped out!!

    I wasn't trapping the SIGHUP (thanks isotope!) that was killing the program because I wasn't running as a daemon - fixed that. I wasn't testing as a daemon because I was really testing other functionality in the program - all this tcp/signal/reboot stuff just came up as a side note, and I didn't think of the fg/bg issue and what signal I would actually get...

    pg has a great point, that the peer should worry about closing the connection if he doesn't hear from me - the code is already working in a keep-alive (heartbeat?), non-blocking framework, so it wasn't to hard to add the close-connection on the other side. That was the real fix.

    I am still going to do the clean-up outside the signal handler - I originally did some stuff like dumping my state and logging to a file inside the signal handler, and I saw intermittent SIGSEGV's when I was killing the program, which rang a bell about re-entrant syscall problems, as I mentioned. If the program doesn't have enough time to clean-up. so be it... It's all non-blocking anyway, and the poll timeout should be in the neighborhood of 0.2 secs, so I'm not too worried...

    Thanks again trusty monks...

    --
    3dan