Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a perl script that needs to run 24 hrs a day. Occasionally it fails and stops (something I'm looking into at the moment - probably involving broken pipes), what is the best way to monitor the process to ensure that it is running and if not restarting it?

Replies are listed 'Best First'.
Re: Process Reliablity
by barndoor (Pilgrim) on Jul 20, 2000 at 15:03 UTC
    One way I've started using (thanks Dave B for the idea), is to get your process to log its pid when it starts (into a file or database table) and then remove that entry when it closes.

    A little script starts every 10 minutes (either via cron or via a sleep command) and looks in the file/database for all the running processes. It then looks for those pids on the system using 'kill 0' tests. If it finds a pid in the list which isn't running it knows the process crashed.

    For your purposes you may want this check script to run from cron so that it will reliably start every 'n' minutes.

    The idea can be extended to support many jobs which you want to watch for failure. Hope this idea helps.
Re: Process Reliablity
by atl (Pilgrim) on Jul 20, 2000 at 15:58 UTC
    Nice, elegant suggestions, and I would recommend using the pid file approach. There is a trick how you can make this file disappear even if the process crashes (some inode trickery, but I cannot give an example right now).

    If you need a quick hack, I can offer you two low level solutions (sort of). Assuming you are using some sort of unix, you might want to try this:

    1. Use a wrapper script to start your program. It can restart it whenever the program crashed:

    #!/bin/sh while true do start_your_program sleep xxx done
    Advantage: the auto restart releaves you of checking all the time.
    Beware, though, that this might mean trouble if your program does ugly things when restarted after a crash and might put considerable load on your machine in a continous start-crash-restart cycle if you omit a proper sleep time.

    2. You might grep through the process list to see if you program is running. This works if it has a sufficiently long and distinct name. Try on a shell:

    ps aux | grep your_program_name | grep -v grep | wc -l
    or
    ps -ef | grep your_program_name | grep -v grep | wc -l
    depending on which unix flavour you use.
    This returns the number of instances of your program running (i.e. usually 0 or 1). YOu can use that from a perl script, too, putting the expression into backticks (``).

    Hope that's a bit useful. Kind of old techniques, but it still works (most of the time) ;-)

    Andreas

Re: Process Reliablity
by dempa (Friar) on Jul 20, 2000 at 15:04 UTC
    One alternative could be to use cron. That is if you're on a Unix system. NT has some sort of cron facility too, but since I'm not familiar with it, I won't speculate in it's use.

    Anyway, you could modify your script so that it saves it's own PID in a file. (You can get the PID from the special variable $$). Then have your new script (executed from cron every 5 minutes or so) look in that file and then check if that PID is active. If the PID is running, be sure to check the args too, so it's really your script that has that PID.

    This would be really easy in Unix. I guess it could be done in NT too? I'll let someone else answer that...

Re: Process reliability
by young perlhopper (Scribe) on Jul 20, 2000 at 16:50 UTC
    The suggestion about using a wrapper script to simply restart the program when it crashes is a good one, but I'd suggest you do two more things. Check the return value of the script, so that if it exits normally, (e.g. on a change of run level) the script will allow it to do so.

    Secondly, be sure to log all the starts and restarts or at least notify somebody about them, so that you always have a good idea of what is going on. Otherwise, you are likely to forget about it after it "just works" (i do this all the time too, its an easy habit to get into) and never be aware of what is going on.

    Good luck,
    Mark Logan

      Good point! Both of them. That would make a not-so-quick but better hack. Append a timestamp and a start note / end note with exit code to a log file, to have a history. You might also send an email in case of a crash. Let's see ...

      #!/bin/sh while true do start_your_program RC=$? # return code if [ $RC -gt 0 ] then date >> /var/log/your_log_file echo "ABNORMAL program termination, rc = $RC" >> /var/log/your_log +_file echo "terminated at `date`, restarting | mail -s "Problem with zzz +" root sleep xxx else date >> /var/log/your_log_file echo "Normal program termination, rc = $RC" >> /var/log/your_log_f +ile sleep yyy fi done
      You can do further checking on the exit code as Mark suggested and act diffently according to the exit code (see case statement in your shell manual).

      Andreas

RE: Process Reliablity
by DrManhattan (Chaplain) on Jul 20, 2000 at 17:47 UTC
    Try wrapping your script in something like this:
    #!/usr/bin/perl use strict; # Loop indefinitely while (1) { # Fork off a child process my $kidpid = fork(); if ($kidpid) { # This is the parent process. Wait for # the child to exit waitpid($kidpid, 0); # Put some code here to send you an alert # when the child dies. You can also # check the child's exit condition here # with $?. } elsif (defined($kidpid)) { # This is the child process. # Put your original script in here # or just use exec() } else { die "could not fork"; } }

    That will fork off a child process to handle your script and restart it every time it dies

    -Matt

Re: Process Reliablity
by c-era (Curate) on Jul 20, 2000 at 15:05 UTC
    You could set a cron job that checks the program on a regular basis. You can also make deamon that starts the program, and monitors the program to make sure it is always running. If the program is a server you can also try using inetd to start your program when there is a request.

    If you are on WinNT I would suggest that you make your program a service.

Re: Process Reliablity
by lhoward (Vicar) on Jul 20, 2000 at 17:51 UTC
    How about catching the SIGCHLD messages and using that as your trigger to start the process again. Something along these lines.
    #!/usr/bin/perl -w use strict; launch_child(); $SIG{CHLD}=\&launch_child; sleep 60 while(1); sub launch_child{ print "$$ parent spawning a child\n"; my $pid=fork; if(!$pid){ print "$$ inside the child\n"; # all the code that does the real work is in here # all the other stuff is just a wrapper to keep # this bit going sleep 60 while(1); } }
    If you really care about your process you'll back this up with some of the methods mentioned above. You may also want to put some checks on to keep the process from thrashing (constantly restarting the process, which immediately dies again, etc...) just in case. This technique does have the nice side-effect that the restarts are nearly instentaneous.

      That's an interesting solution, but it's likely to dump core. From perlipc:

      Do as little as you possibly can in your [signal] handler ... because on most systems, libraries are not re-entrant; particularly, memory allocation and I/O routines are not. That means that doing nearly anything in your handler could in theory trigger a memory fault and subsequent core dump.

      -Matt

        Under normal situations that would be the case, but since my main program doesn't do anything other than sleep there are no non-reentrant pieces of code that could be interrupted by the dying process (causing a core dump). I have used this technique before and it has proven to be quite stable.
Re: Process Reliablity
by mikfire (Deacon) on Jul 20, 2000 at 17:05 UTC
    Okay, I cannot resist. If you are using a system with SysV style init, just put it in inittab with the respawn key word. Why do the work when the system will do it for you?

    mikfire

      Because you are in trouble if your script keeps crashing all the time. This would generate a severe load on your machine. At least, that would keep me from hooking it up into the systems runlevel until I am _very_ sure this program behaves well even in insane situations.

      Just my two cents ;-) ...

      Andreas

        That problem already exists. If the script is dying a lot, it doesn't matter what respawn mechanism is used. There will still be a significant load placed on the system.

        The problem description, though, makes it sound like it only happens occassionally, which has made it hard to debug. In that case, I would rather let a proven and well-known mechanism like initd do the monitoring for me than having to debug both the respawn code and the code that is dying in the first place.

        Second, initd is usually pretty smart and stops trying a job if it is respawning too quickly. So your load increase for a bit, but initd does the Right Thing and stops it from becoming a fork bomb.

        mikfire

        Most versions of init that I've run into will notice that an entry has restarted over and over in a short time and just complain to the console that entry "xyz" is restarting too much and it won't restart it anymore until you change the inittab.

        You see this a lot with flaky terminals (people still know what a terminal is, don't they?) where getty keeps croaking and init just gives up on it.

        I think init is perfect solution for this problem if you have access to it.

Re: Process Reliablity
by gaggio (Friar) on Jul 20, 2000 at 17:28 UTC
    What kind of program are you running for 24 hours?

    If it is some sort of server, you could simply implement a command "AREYOUHERE" and know that your program would have to answer "YESIAM" for it to be currently running.

    That way, you could even check if your program is running from a remote machine...
Re: Process Reliablity
by arturo (Vicar) on Jul 20, 2000 at 18:41 UTC
    Well, if the problem is a broken pipe, why not install a signal handler to trap that error (in fact, why not install signal handlers for all the errors you might run into?). That way, if the program traps an error, it can execute any "cleanup code" that needs to be done before wiping the pid file and shutting down. I believe the syntax for a signal handler involves manipulating the "pseudo" hash %SIG and looks something like this.

    $SIG{PIPE} = sub { #handle error }
    There's also a sigtrap module, you might want to check that out
Re: Process Reliablity
by AgentM (Curate) on Jul 20, 2000 at 18:17 UTC
    I like all of your answers, but I can't see how that would help him debug his program. He will need some sort of debug mechanism, not just knowing when his program dies. It would be certainly more useful to know 1) why it died and 2) how to stop it from dying. For this purpose, I would recommend an error log to identifz weak spots and Perl error messages. (It's unlikelz that you'll get much useful error reports from the return value of the program.) He should also try setting up program "areas": which region of code can be said as "doing one task". Once he has these isolated, he can use debug messaging or error logs to determine what works, what doesn't and on which reiteration of the code the program dies. In C/C++, program crashes could be linked to memory leaks 90% of the time. Since Perl handles all memory management, he'll need to isolate his weak spot using basic report programming techniques.