Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Thank you all for your responses to my "style" question. Your feedback was much appreciated.

I have need to monitor (functionally) whether several of my web apps are up and running. Monitoring the servers, and looking for processes I have covered. However for the app to be "up", all of the tiers need to be running correctly. I can do this functional test by pulling up specific URLs in a browser, but I need to automate this to run unattended. Long term I want to buy tools to integrate with my systems management stuff, but for now automating this test will greatly improve my monitoring capability.

I've looked around quite a bit, thinking someone would have built something like this already, but I really haven't found that much. I think I could cook something up using LWP or others and was wondering if any monks had done something similar and might point me in the right direction.

TIA,
Rich

Replies are listed 'Best First'.
Re: Web Site Monitoring
by grep (Monsignor) on Feb 15, 2002 at 06:51 UTC
    Netsaint is your fully formed, well functioning, wheel you do not need to reinvent. Netsaint is a complete framework for monitoring systems and websites. It handles scheduling, notification, and contains a very comprehensive set of tests right out of the 'box'.

    It is also completely modular and you can write tests in perl (I have written several).

    grep
    grep> rm -f /bin/laden
      I agree about Netsaint, there is hardly any point in reinventing the wheel if you don't have reasons (such as I want to :) ).

      Anyway, if you do decide to roll your own, here are a few pointers you might find useful - been there, done that.

      • Be very careful when defining what is "up" and what is "down". The script should not report that everything is down just because the server is a bit strained. Give it a second chance.
      • Be very, very careful when defining what is "up" and what is "down" if this desicion will make your script take any automatic action (maybe restart the application or something - on one hand, a restart might freshen everything up, on the other hand, in some designs it could cause customers to lose their session - and their carts with it!).
      • Consider making a special "stats" page for your script to access, so that the server can fetch some data as well. Perhaps the load on the server, how many sessions deemed active... XML is nice for this. Make it protected, though.
      • Consider not using a special page for the script. At least make sure you just don't test "index.html". Make the script test several pages, and if possible, simulate a flow over several pages (yes, this takes some coding).
      • Email is your friend. Mail yourself warnings when certain criteria is filled. Maybe mail yourself if things start to look good again too, so you will know that too.
      • Email is your enemy. This pertains to the first points, about defining when "down" really is. If you get several mails a day, and especially if you are getting false alarms, you will very soon start to ignore the mails. Do not send mail unnecessarily.
      • Log as much as possible. Anything you can think of might help later.
      • Log as little as possible. You don't want to sift through an apache-access sized log file to get the facts you need. Make sure you can easily find the facts you need in the logs, via timestamps and such. Also use a special UserAgent header for your surfing script (for the normal weblog).
      • Let the script surf from someplace else, outside your firewall, preferably from some totally different location. Overseas would be great. :) Otherwise, something besides your site might be down, and you wouldn't know.

      Of course, there are tons, and tons of more things, but these points I could think of right away, and I know several of them would have helped me, had someone told me. :) As you can see, most of the points contradict one or several of the other points. This is intentional - both sides are correct to some extent, and the idea is to find the balance. For instance, a surfing type of script that times out after 10 seconds, reporting that the site is dead, is most likely a very bad idea - but so is a script that takes 10 minutes. Maybe a one minute timeout, with a doublecheck would be appropriate? Only you can answer that.

      I hope these ideas gave you some hints on how to go about it. :)

        OK Let's try this again. Thanks to all of you for your replies. Lot's of good pointers and items I had already considered. Since my replies don't seem to be staying with the right subthread, I figured I would write this one.

        • Netsaint sounds good, and may get me up and running quickly
        • I would still like to write something using LWP or HTTP::WebTest
        • I appreciate all of the tips, even the contradictory ones. I recognize that is how the real world works and that your providing these tips is a dangerous thing to do, since you don't know my apps or environment.

        To all who replied, thanks. If I end up building something, I'll post it so you can "rip it to shreds" :> -Rich

      Thanks for pointing me towards NetSaint. It may be just what I need to get going quickly.
      Thanks for the tip. I have been searching for something off and on for a year and hadn't come across this. I will definitely look into it.
Re: Web Site Monitoring
by IlyaM (Parson) on Feb 15, 2002 at 10:09 UTC
Re: Web Site Monitoring
by Ryszard (Priest) on Feb 15, 2002 at 06:52 UTC
    use LWP;

    Make sure tho' if your web server sits between your corporate world and the customer land, you are grabbing the URL from customer land, otherwise you may be able to get your pages, but customers cant because of some problem. (ACL problems are relatively common)

    Sure your app is running, but no-one can get to it! Just another layer of checking that can be done with relative ease.

Re: Web Site Monitoring
by ehdonhon (Curate) on Feb 15, 2002 at 15:14 UTC

    Redundancy is your friend.

    Especially if what you are monitoring is mission critical, you want to be able to monitor your site, but you also want to something to monitor your monitoring software. What happens if it turns out the machine watching your server also went out in the same power failure?

    I would say you want something like this:

    • Server A is serving your pages.
    • Server B is separated far enough from Server A so that its unlikely that they would be affected by the same outages (or at least you would have another way of knowing if they were).
    • Server B monitors server A.
    • Server A monitors server B.

    There are other points to consider. For example, how does your software run? Is it a daemon? Do you have something that will catch when the daemon dies? Is it running by cron? What will let you know if cron dies?

    The other item is monitoring vs. management. Its far better to have a report that says "Hey your server went down and I restarted it for you, everything is ok now." then to have a report that says "Hey your server is down, your paying customer will be complaining soon, hurry up and restart it.".