Re: Process Reliablity
by barndoor (Pilgrim) on Jul 20, 2000 at 15:03 UTC
|
One way I've started using (thanks Dave B for the idea), is to get your process to log its pid when it starts (into a file or database table) and then remove that entry when it closes.
A little script starts every 10 minutes (either via cron or via a sleep command) and looks in the file/database for all the running processes. It then looks for those pids on the system using 'kill 0' tests. If it finds a pid in the list which isn't running it knows the process crashed.
For your purposes you may want this check script to run from cron so that it will reliably start every 'n' minutes.
The idea can be extended to support many jobs which you want to watch for failure. Hope this idea helps.
| [reply] |
Re: Process Reliablity
by atl (Pilgrim) on Jul 20, 2000 at 15:58 UTC
|
Nice, elegant suggestions, and I would recommend using
the pid file approach. There is a trick how you can
make this file disappear even if the process crashes
(some inode trickery, but I cannot give an example
right now).
If you need a quick hack, I can offer you two
low level solutions (sort of). Assuming you are using
some sort of unix, you might want to try this:
1. Use a wrapper script to start your program. It can
restart it whenever the program crashed:
#!/bin/sh
while true
do
start_your_program
sleep xxx
done
Advantage: the auto restart releaves you of checking
all the time.
Beware, though, that this might mean trouble if your
program does ugly things when restarted after a crash
and might put considerable load on your machine in
a continous start-crash-restart cycle if you omit a
proper sleep time.
2. You might grep through the process list to see if
you program is running. This works if it has a sufficiently
long and distinct name. Try on a shell:
ps aux | grep your_program_name | grep -v grep | wc -l
or
ps -ef | grep your_program_name | grep -v grep | wc -l
depending on which unix flavour you use.
This returns the number of instances of your program
running (i.e. usually 0 or 1). YOu can use that from
a perl script, too, putting the expression into
backticks (``).
Hope that's a bit useful. Kind of old techniques, but it
still works (most of the time) ;-)
Andreas
| [reply] [d/l] [select] |
Re: Process Reliablity
by dempa (Friar) on Jul 20, 2000 at 15:04 UTC
|
One alternative could be to use cron. That is if you're
on a Unix system. NT has some sort of cron facility too,
but since I'm not familiar with it, I won't speculate
in it's use.
Anyway, you could modify your script so that it saves
it's own PID in a file. (You can get the PID from the
special variable $$). Then have your new script (executed
from cron every 5 minutes or so) look in that file and
then check if that PID is active. If the PID is running,
be sure to check the args too, so it's really your script
that has that PID.
This would be really easy in Unix. I guess it could be
done in NT too? I'll let someone else answer that... | [reply] |
Re: Process reliability
by young perlhopper (Scribe) on Jul 20, 2000 at 16:50 UTC
|
The suggestion about using a wrapper script to simply
restart the program when it crashes is a good one, but I'd
suggest you do two more things. Check the return value of
the script, so that if it exits normally, (e.g. on a change
of run level) the script will allow it to do so.
Secondly, be sure to log all the starts and restarts or at
least notify somebody about them, so that you always have a
good idea of what is going on. Otherwise, you are likely to
forget about it after it "just works" (i do this all the time
too, its an easy habit to get into) and never be aware of
what is going on.
Good luck,
Mark Logan | [reply] |
|
|
Good point! Both of them. That would make a not-so-quick
but better hack. Append a timestamp and a start note /
end note with exit code to a log file, to have a history.
You might also send an email in case of a crash. Let's
see ...
#!/bin/sh
while true
do
start_your_program
RC=$? # return code
if [ $RC -gt 0 ]
then
date >> /var/log/your_log_file
echo "ABNORMAL program termination, rc = $RC" >> /var/log/your_log
+_file
echo "terminated at `date`, restarting | mail -s "Problem with zzz
+" root
sleep xxx
else
date >> /var/log/your_log_file
echo "Normal program termination, rc = $RC" >> /var/log/your_log_f
+ile
sleep yyy
fi
done
You can do further checking on the exit code as Mark suggested
and act diffently according to the exit code (see case
statement in your shell manual).
Andreas | [reply] [d/l] |
RE: Process Reliablity
by DrManhattan (Chaplain) on Jul 20, 2000 at 17:47 UTC
|
Try wrapping your script in something like this:
#!/usr/bin/perl
use strict;
# Loop indefinitely
while (1)
{
# Fork off a child process
my $kidpid = fork();
if ($kidpid)
{
# This is the parent process. Wait for
# the child to exit
waitpid($kidpid, 0);
# Put some code here to send you an alert
# when the child dies. You can also
# check the child's exit condition here
# with $?.
} elsif (defined($kidpid)) {
# This is the child process.
# Put your original script in here
# or just use exec()
} else {
die "could not fork";
}
}
That will fork off a child process to handle your script
and restart it every time it dies
-Matt | [reply] [d/l] |
Re: Process Reliablity
by c-era (Curate) on Jul 20, 2000 at 15:05 UTC
|
| [reply] |
Re: Process Reliablity
by lhoward (Vicar) on Jul 20, 2000 at 17:51 UTC
|
How about catching the SIGCHLD messages and using that
as your trigger to start the process again. Something along
these lines.
#!/usr/bin/perl -w
use strict;
launch_child();
$SIG{CHLD}=\&launch_child;
sleep 60 while(1);
sub launch_child{
print "$$ parent spawning a child\n";
my $pid=fork;
if(!$pid){
print "$$ inside the child\n";
# all the code that does the real work is in here
# all the other stuff is just a wrapper to keep
# this bit going
sleep 60 while(1);
}
}
If you really care about your process
you'll back this up with some of the methods mentioned above.
You may also want to put some checks on to keep
the process from thrashing (constantly restarting the process,
which immediately dies again, etc...) just in case.
This technique does have the nice side-effect that the
restarts are nearly instentaneous. | [reply] [d/l] |
|
|
| [reply] |
|
|
Under normal situations that would be the case, but since
my main program doesn't do anything other than sleep there
are no non-reentrant pieces of code that could be interrupted
by the dying process (causing a core dump). I have used
this technique before and it has proven to be quite stable.
| [reply] |
Re: Process Reliablity
by mikfire (Deacon) on Jul 20, 2000 at 17:05 UTC
|
Okay, I cannot resist. If you are using a system with SysV
style init, just put it in inittab with the respawn key word.
Why do the work when the system will do it for you?
mikfire | [reply] |
|
|
| [reply] |
|
|
That problem already exists. If the script is dying a lot,
it doesn't matter what respawn mechanism is used. There will
still be a significant load placed on the system.
The problem description, though, makes it sound like it only
happens occassionally, which has made it hard to debug. In
that case, I would rather let a proven and well-known
mechanism like initd do the monitoring for me than having to
debug both the respawn code and the code that is dying in the
first place.
Second, initd is usually pretty smart and stops trying
a job if it is respawning too quickly. So your load increase
for a bit, but initd does the Right Thing and stops it from
becoming a fork bomb.
mikfire
| [reply] |
|
|
|
|
Most versions of init that I've run into
will notice that an entry has restarted over and over in
a short time and just complain to the console that entry
"xyz" is restarting too much and it won't restart it anymore
until you change the inittab.
You see this a lot with flaky terminals (people still
know what a terminal is, don't they?) where getty keeps croaking and init just gives up
on it.
I think init is perfect solution for this
problem if you have access to it.
| [reply] [d/l] [select] |
Re: Process Reliablity
by gaggio (Friar) on Jul 20, 2000 at 17:28 UTC
|
What kind of program are you running for 24 hours?
If it is some sort of server, you could simply implement a command "AREYOUHERE" and know that your program would have to answer "YESIAM" for it to be currently running.
That way, you could even check if your program is running from a remote machine...
| [reply] |
Re: Process Reliablity
by arturo (Vicar) on Jul 20, 2000 at 18:41 UTC
|
Well, if the problem is a broken pipe, why not install a
signal handler to trap that error (in fact, why not install
signal handlers for all the errors you might run into?). That way, if the program traps an error, it can execute
any "cleanup code" that needs to be done before wiping the pid file and shutting down.
I believe the syntax for a signal handler involves manipulating the "pseudo" hash %SIG
and looks something like this.
$SIG{PIPE} = sub { #handle error }
There's also a sigtrap module, you might want to check that out
| [reply] [d/l] |
Re: Process Reliablity
by AgentM (Curate) on Jul 20, 2000 at 18:17 UTC
|
I like all of your answers, but I can't see how that would help him debug his program. He will need some sort of debug mechanism, not just knowing when his program dies. It would be certainly more useful to know 1) why it died and 2) how to stop it from dying. For this purpose, I would recommend an error log to identifz weak spots and Perl error messages. (It's unlikelz that you'll get much useful error reports from the return value of the program.) He should also try setting up program "areas": which region of code can be said as "doing one task". Once he has these isolated, he can use debug messaging or error logs to determine what works, what doesn't and on which reiteration of the code the program dies. In C/C++, program crashes could be linked to memory leaks 90% of the time. Since Perl handles all memory management, he'll need to isolate his weak spot using basic report programming techniques. | [reply] |