Process Reliablity

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Process Reliablity by atl (Pilgrim) on Jul 20, 2000 at 15:58 UTC
Nice, elegant suggestions, and I would recommend using the pid file approach. There is a trick how you can make this file disappear even if the process crashes (some inode trickery, but I cannot give an example right now). If you need a quick hack, I can offer you two low level solutions (sort of). Assuming you are using some sort of unix, you might want to try this: 1. Use a wrapper script to start your program. It can restart it whenever the program crashed: `#!/bin/sh while true do start_your_program sleep xxx done` [download] Advantage: the auto restart releaves you of checking all the time. Beware, though, that this might mean trouble if your program does ugly things when restarted after a crash and might put considerable load on your machine in a continous start-crash-restart cycle if you omit a proper sleep time. 2. You might grep through the process list to see if you program is running. This works if it has a sufficiently long and distinct name. Try on a shell: `ps aux \| grep your_program_name \| grep -v grep \| wc -l` [download] or `ps -ef \| grep your_program_name \| grep -v grep \| wc -l` [download] depending on which unix flavour you use. This returns the number of instances of your program running (i.e. usually 0 or 1). YOu can use that from a perl script, too, putting the expression into backticks (``). Hope that's a bit useful. Kind of old techniques, but it still works (most of the time) ;-) Andreas	[reply] [d/l] [select]
Re: Process Reliablity by barndoor (Pilgrim) on Jul 20, 2000 at 15:03 UTC
One way I've started using (thanks Dave B for the idea), is to get your process to log its pid when it starts (into a file or database table) and then remove that entry when it closes. A little script starts every 10 minutes (either via cron or via a sleep command) and looks in the file/database for all the running processes. It then looks for those pids on the system using 'kill 0' tests. If it finds a pid in the list which isn't running it knows the process crashed. For your purposes you may want this check script to run from cron so that it will reliably start every 'n' minutes. The idea can be extended to support many jobs which you want to watch for failure. Hope this idea helps.	[reply]
Re: Process reliability by young perlhopper (Scribe) on Jul 20, 2000 at 16:50 UTC
The suggestion about using a wrapper script to simply restart the program when it crashes is a good one, but I'd suggest you do two more things. Check the return value of the script, so that if it exits normally, (e.g. on a change of run level) the script will allow it to do so. Secondly, be sure to log all the starts and restarts or at least notify somebody about them, so that you always have a good idea of what is going on. Otherwise, you are likely to forget about it after it "just works" (i do this all the time too, its an easy habit to get into) and never be aware of what is going on. Good luck, Mark Logan	[reply]
RE: Re: Process reliability by atl (Pilgrim) on Jul 20, 2000 at 17:51 UTC
Good point! Both of them. That would make a not-so-quick but better hack. Append a timestamp and a start note / end note with exit code to a log file, to have a history. You might also send an email in case of a crash. Let's see ... #!/bin/sh while true do start_your_program RC=$? # return code if [ $RC -gt 0 ] then date >> /var/log/your_log_file echo "ABNORMAL program termination, rc = $RC" >> /var/log/your_log +_file echo "terminated at `date`, restarting \| mail -s "Problem with zzz +" root sleep xxx else date >> /var/log/your_log_file echo "Normal program termination, rc = $RC" >> /var/log/your_log_f +ile sleep yyy fi done [download] You can do further checking on the exit code as Mark suggested and act diffently according to the exit code (see case statement in your shell manual). Andreas	[reply] [d/l]
Re: Process Reliablity by dempa (Friar) on Jul 20, 2000 at 15:04 UTC
One alternative could be to use cron. That is if you're on a Unix system. NT has some sort of cron facility too, but since I'm not familiar with it, I won't speculate in it's use. Anyway, you could modify your script so that it saves it's own PID in a file. (You can get the PID from the special variable $$). Then have your new script (executed from cron every 5 minutes or so) look in that file and then check if that PID is active. If the PID is running, be sure to check the args too, so it's really your script that has that PID. This would be really easy in Unix. I guess it could be done in NT too? I'll let someone else answer that...	[reply]
RE: Process Reliablity by DrManhattan (Chaplain) on Jul 20, 2000 at 17:47 UTC
Try wrapping your script in something like this: `#!/usr/bin/perl use strict; # Loop indefinitely while (1) { # Fork off a child process my $kidpid = fork(); if ($kidpid) { # This is the parent process. Wait for # the child to exit waitpid($kidpid, 0); # Put some code here to send you an alert # when the child dies. You can also # check the child's exit condition here # with $?. } elsif (defined($kidpid)) { # This is the child process. # Put your original script in here # or just use exec() } else { die "could not fork"; } }` [download] That will fork off a child process to handle your script and restart it every time it dies -Matt	[reply] [d/l]
Re: Process Reliablity by c-era (Curate) on Jul 20, 2000 at 15:05 UTC
You could set a cron job that checks the program on a regular basis. You can also make deamon that starts the program, and monitors the program to make sure it is always running. If the program is a server you can also try using inetd to start your program when there is a request. If you are on WinNT I would suggest that you make your program a service.	[reply]
Re: Process Reliablity by lhoward (Vicar) on Jul 20, 2000 at 17:51 UTC
How about catching the SIGCHLD messages and using that as your trigger to start the process again. Something along these lines. `#!/usr/bin/perl -w use strict; launch_child(); $SIG{CHLD}=\&launch_child; sleep 60 while(1); sub launch_child{ print "$$ parent spawning a child\n"; my $pid=fork; if(!$pid){ print "$$ inside the child\n"; # all the code that does the real work is in here # all the other stuff is just a wrapper to keep # this bit going sleep 60 while(1); } }` [download] If you really care about your process you'll back this up with some of the methods mentioned above. You may also want to put some checks on to keep the process from thrashing (constantly restarting the process, which immediately dies again, etc...) just in case. This technique does have the nice side-effect that the restarts are nearly instentaneous.	[reply] [d/l]
RE: Re: Process Reliablity by DrManhattan (Chaplain) on Jul 20, 2000 at 18:40 UTC
That's an interesting solution, but it's likely to dump core. From perlipc: Do as little as you possibly can in your [signal] handler ... because on most systems, libraries are not re-entrant; particularly, memory allocation and I/O routines are not. That means that doing nearly anything in your handler could in theory trigger a memory fault and subsequent core dump. -Matt	[reply]
RE: RE: Re: Process Reliablity by lhoward (Vicar) on Jul 20, 2000 at 18:46 UTC
Under normal situations that would be the case, but since my main program doesn't do anything other than sleep there are no non-reentrant pieces of code that could be interrupted by the dying process (causing a core dump). I have used this technique before and it has proven to be quite stable.	[reply]
Re: Process Reliablity by mikfire (Deacon) on Jul 20, 2000 at 17:05 UTC
Okay, I cannot resist. If you are using a system with SysV style init, just put it in inittab with the respawn key word. Why do the work when the system will do it for you? mikfire	[reply]
RE: Re: Process Reliablity by atl (Pilgrim) on Jul 20, 2000 at 17:59 UTC
Because you are in trouble if your script keeps crashing all the time. This would generate a severe load on your machine. At least, that would keep me from hooking it up into the systems runlevel until I am _very_ sure this program behaves well even in insane situations. Just my two cents ;-) ... Andreas	[reply]
RE: RE: Re: Process Reliablity by mikfire (Deacon) on Jul 20, 2000 at 18:05 UTC
That problem already exists. If the script is dying a lot, it doesn't matter what respawn mechanism is used. There will still be a significant load placed on the system. The problem description, though, makes it sound like it only happens occassionally, which has made it hard to debug. In that case, I would rather let a proven and well-known mechanism like initd do the monitoring for me than having to debug both the respawn code and the code that is dying in the first place. Second, initd is usually pretty smart and stops trying a job if it is respawning too quickly. So your load increase for a bit, but initd does the Right Thing and stops it from becoming a fork bomb. mikfire	[reply]
Multiple-Re: Process Reliablity by atl (Pilgrim) on Jul 20, 2000 at 18:31 UTC
RE: RE: Re: Process Reliablity by tye (Sage) on Jul 20, 2000 at 18:17 UTC
Most versions of `init` that I've run into will notice that an entry has restarted over and over in a short time and just complain to the console that entry "xyz" is restarting too much and it won't restart it anymore until you change the inittab. You see this a lot with flaky terminals (people still know what a terminal is, don't they?) where `getty` keeps croaking and init just gives up on it. I think `init` is perfect solution for this problem if you have access to it.	[reply] [d/l] [select]
Re: Process Reliablity by gaggio (Friar) on Jul 20, 2000 at 17:28 UTC
What kind of program are you running for 24 hours? If it is some sort of server, you could simply implement a command "AREYOUHERE" and know that your program would have to answer "YESIAM" for it to be currently running. That way, you could even check if your program is running from a remote machine...	[reply]
Re: Process Reliablity by arturo (Vicar) on Jul 20, 2000 at 18:41 UTC
Well, if the problem is a broken pipe, why not install a signal handler to trap that error (in fact, why not install signal handlers for all the errors you might run into?). That way, if the program traps an error, it can execute any "cleanup code" that needs to be done before wiping the pid file and shutting down. I believe the syntax for a signal handler involves manipulating the "pseudo" hash %SIG and looks something like this. `$SIG{PIPE} = sub { #handle error }` [download] There's also a sigtrap module, you might want to check that out	[reply] [d/l]
Re: Process Reliablity by AgentM (Curate) on Jul 20, 2000 at 18:17 UTC
I like all of your answers, but I can't see how that would help him debug his program. He will need some sort of debug mechanism, not just knowing when his program dies. It would be certainly more useful to know 1) why it died and 2) how to stop it from dying. For this purpose, I would recommend an error log to identifz weak spots and Perl error messages. (It's unlikelz that you'll get much useful error reports from the return value of the program.) He should also try setting up program "areas": which region of code can be said as "doing one task". Once he has these isolated, he can use debug messaging or error logs to determine what works, what doesn't and on which reiteration of the code the program dies. In C/C++, program crashes could be linked to memory leaks 90% of the time. Since Perl handles all memory management, he'll need to isolate his weak spot using basic report programming techniques.	[reply]