Netsaint is your fully formed, well functioning, wheel you do not need to reinvent. Netsaint is a complete framework for monitoring systems and websites. It handles scheduling, notification, and contains a very comprehensive set of tests right out of the 'box'.
It is also completely modular and you can write tests in perl (I have written several).
grep
| [reply] |
I agree about Netsaint, there is hardly any point in reinventing the wheel if you don't have reasons (such as I want to :) ).
Anyway, if you do decide to roll your own, here are a few pointers you might find useful - been there, done that.
- Be very careful when defining what is "up" and what is "down". The script should not report that everything is down just because the server is a bit strained. Give it a second chance.
- Be very, very careful when defining what is "up" and what is "down" if this desicion will make your script take any automatic action (maybe restart the application or something - on one hand, a restart might freshen everything up, on the other hand, in some designs it could cause customers to lose their session - and their carts with it!).
- Consider making a special "stats" page for your script to access, so that the server can fetch some data as well. Perhaps the load on the server, how many sessions deemed active... XML is nice for this. Make it protected, though.
- Consider not using a special page for the script. At least make sure you just don't test "index.html". Make the script test several pages, and if possible, simulate a flow over several pages (yes, this takes some coding).
- Email is your friend. Mail yourself warnings when certain criteria is filled. Maybe mail yourself if things start to look good again too, so you will know that too.
- Email is your enemy. This pertains to the first points, about defining when "down" really is. If you get several mails a day, and especially if you are getting false alarms, you will very soon start to ignore the mails. Do not send mail unnecessarily.
- Log as much as possible. Anything you can think of might help later.
- Log as little as possible. You don't want to sift through an apache-access sized log file to get the facts you need. Make sure you can easily find the facts you need in the logs, via timestamps and such. Also use a special UserAgent header for your surfing script (for the normal weblog).
- Let the script surf from someplace else, outside your firewall, preferably from some totally different location. Overseas would be great. :) Otherwise, something besides your site might be down, and you wouldn't know.
Of course, there are tons, and tons of more things, but these points I could think of right away, and I know several of them would have helped me, had someone told me. :) As you can see, most of the points contradict one or several of the other points. This is intentional - both sides are correct to some extent, and the idea is to find the balance. For instance, a surfing type of script that times out after 10 seconds, reporting that the site is dead, is most likely a very bad idea - but so is a script that takes 10 minutes. Maybe a one minute timeout, with a doublecheck would be appropriate? Only you can answer that.
I hope these ideas gave you some hints on how to go about it. :)
| [reply] |
| [reply] |
Thanks for pointing me towards NetSaint. It may be just what I need to get going quickly.
| [reply] |
Thanks for the tip. I have been searching for something off and on for a year and hadn't come across this. I will definitely look into it.
| [reply] |
use LWP;
Make sure tho' if your web server sits between your corporate world and the customer land, you are grabbing the URL from customer land, otherwise you may be able to get your pages, but customers cant because of some problem. (ACL problems are relatively common)
Sure your app is running, but no-one can get to it! Just another layer of checking that can be done with relative ease. | [reply] [d/l] |
Redundancy is your friend.
Especially if what you are monitoring is mission
critical, you want to be able to monitor your site, but
you also want to something to monitor your monitoring
software. What happens if it turns out the machine
watching your server also went out in the same power
failure?
I would say you want something like this:
- Server A is serving your pages.
- Server B is separated far enough from Server A so that
its unlikely that they would be affected by the same
outages (or at least you would have another way of knowing
if they were).
- Server B monitors server A.
- Server A monitors server B.
There are other points to consider. For example, how
does your software run? Is it a daemon? Do you have
something that will catch when the daemon dies? Is it
running by cron? What will let you know if cron dies?
The other item is monitoring vs. management. Its far
better to have a report that says "Hey your server went
down and I restarted it for you, everything is ok now." then
to have a report that says "Hey your server is down, your
paying customer will be complaining soon, hurry up and
restart it.".
| [reply] |