[OT] Monitoring a website

clinton has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to put together a list of tests to perform on the servers that run my website to ensure that I spot any problems early. Some of those below will be fed into Nagios, and others will just send alarms if there is a problem.

Can you think of any others that I should be checking? Or are some of the below useless?

Update: I forgot to add - I was thinking about building the list of tests using the Test::More etc modules, which I've previously only used for testing modules at install. Anything in particular you'd recommend for structuring these tests?

General
- disk
  - % free disk space
  - free disk space
  - smartd alerts
- memory
  - % used
  - % free
  - % buffered / cached
  - swap
  - biggest processes & their size
- load
  - processes using most load
- network
  - ping
  - have outgoing access
  - traffic in
  - traffic out
- ssh
  - access
  - number of login attempts
Apache / mod_perl
- processes
  - number of processes / max
  - total memory of all processes
  - size of biggest process
  - rate of spawning new processes
- requests
  - requests / second
  - cached responses / second
  - dynamic responses / second
  - kB transferred / second
  - 404 errors / hour
  - 500 errors / hour
- connectivity
  - website responds to test page
  - response time
MySQL
- daily check tables
- can connect
- requests / second
- open_tables / table_cache
- tmp_disk_tables / tmp_tables
- max_used_connections
- threads_connected
- threads_running
- slow_launch_threads
- slow_queries
- query cache hits / query cache requests
- key_reads / key_read_requests
- key_writes / key_write_requests
- table_locks_waited
- aborted_connects
- slave status
- slave seconds behind master
DNS
- dns accessible
- check host entries
- check mx entries
- check SPF entries
Mail
- receiving test mail
- outgoing mail being delivered
- mails in queue
- mails sent per hour
memcached
- hosts accessible
- get hits / cmd_get
- malloc_total_free / malloc_total_alloc

thanks

Clint

Comment on [OT] Monitoring a website

Replies are listed 'Best First'.
Re: [OT] Monitoring a website by jhourcle (Prior) on Oct 02, 2007 at 14:19 UTC
You might want to look at the list of tests that are available for Big Brother, Big Sister, Nagios, MRTG or other network monitoring tools, and see what's on their list that might be useful, that you haven't thought of yet. Personally, I look at two types of monitoring -- alerts for when something's having problems or going to have problems soon (eg, a disk is almost full, a webserver's taking too long to respond), and historical records, so I can try to spot trends and/or capacity planning. (eg, when I worked for a university -- is this measurement a true problem, or just part of a normal cycle (like usage spikes near end/beginning of semesters)) Although many people monitor to send alerts, I found the second to be more valuable -- You can trace back when memory/load/disk usage started going up, but before it hit alert levels, and find what changes were made shortly before that might be causing the problems. You can notice abnormal behaviour (the load goes up every Tuesday morning from 3am-9:30am? Maybe it's a cron job that needs to be moved forward so it completes before the workday starts), etc.	[reply]
Re: [OT] Monitoring a website by Corion (Patriarch) on Oct 02, 2007 at 14:07 UTC
I use some simple LWP::Simple tests to verify that my hosted websites are still working whenever I restart Apache. For the mailrouting, I wrote me a (still unreleased) module that basically queries `exim4` for the rule that applies to a target mail address: for (<DATA>) { my ($address,$expected_rule) = split / /; my @output = `exim4 -bt $mailaddress`; my @routes_as = grep /R:/, @output; is $routes_as[0], $expected_rule, "$mailaddress routes as $expecte +d_rule"; }; __DATA__ ... [download]	[reply] [d/l] [select]
Re: [OT] Monitoring a website by blue_cowdawg (Monsignor) on Oct 02, 2007 at 17:45 UTC
Having set up lots of monitoring over the years using Nagios and its predecessor Netsaint as well as HP OpenView and Sun Net Monitor I can tell you that figuring out what to monitor is always an excersise that needs to be well thought out. One thing I'd caution against is monitoring too much. Anything you run against a system is going to have some form of penalty however slight that might be. If you have a lot of slight penalties you can cause a death of a thousand scratches to what you are trying to monitor. Sort of an extreme example of Heisenburgh Uncertainty where you are affecting what you are trying to measure. How I normally select what to monitor is to first determine what is important to monitor. That whole list you have, however impressive may be, may not be all items that are important to monitor. Start with the basics. What applications am I running What will the user see if it goes down? Is the box up/down? Is it usable? Then you build from there. Having said all that... I'll just say this: K.I.S.S. Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply]
Re^2: [OT] Monitoring a website by clinton (Priest) on Oct 02, 2007 at 17:50 UTC
Agreed. And it may be that some of these things can be measured initially so that we can figure out what normal is, and then can be reduced to once an hour, or once a day. It reminds me of when I was working in paediatrics, and we had a premature baby who had been very sick. We'd treated him for a long time, and he had gradually recovered, but he had persistent anaemia, and we couldn't find a reason for it. Eventually, we figured out that it was because we had been monitoring him so closely - taking blood every day. We stopped checking, and he recovered nicely. thanks for the advice Clint	[reply]
Re: [OT] Monitoring a website by CountZero (Bishop) on Oct 02, 2007 at 15:05 UTC
I monitor my home network and servers through NAGIOS and now that there have been a few recent updates to Nagios::Plugin I will probably extend the monitoring through home-grown Nagios-Perl plugins. The combination of Nagios and Perl seems to open a wide expanse of monitoring possibilities. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: [OT] Monitoring a website by talexb (Chancellor) on Oct 02, 2007 at 20:01 UTC
We use Nagios, and we monitor the following server parameters: inodes and space available on each drive swap usage and load nominal, ssh available some custom checks to make sure the web application is up and running (pid and http checks) a check that PostgreSQL is alive That's just an application server -- the other things you list are numbers that might be interesting, but probably won't signal that the server is going down shortly. Rather, they are stats that I'd probably look at once a week, but I wouldn't want to set any alarms for them. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re: [OT] Monitoring a website by swngnmonk (Pilgrim) on Oct 02, 2007 at 19:34 UTC
As others have already mentioned, this is a realm that requires a lot of planning and thought. And there are plenty of off-the-shelf packages available out there. With that, I'd throw my two cents in for mon: http://www.kernel.org/software/mon/ It's written almost entirely in perl, it's extremely extensible, and it's fantastic for application-level monitoring, which I feel like a lot of the network-monitoring applications aren't well-suited to. We use it to monitor and support an extremely complex server infrastructure that has a lot of dependencies and moving parts - mon has done a fantastic job for us.	[reply]
Re: [OT] Monitoring a website by DutchCoder (Scribe) on Oct 02, 2007 at 20:18 UTC
Hi, At work (50+ webservers) we use most of these tests every five minutes (errors activate a text message service) and we build graphs of 8 functions. You might want to add "Total Load" and "CPU busy". Don't for get to have an external hosting account checking if the server is reachable (every five minutes). (Error activates a text message service)	[reply]
Re: [OT] Monitoring a website by misc (Friar) on Oct 03, 2007 at 09:11 UTC
I'm also checking the temperatures on my servers. (harddisk, cpu, mainboard,..) Assuming you run linux, you should be able to grep the temperatures in the files at /proc/acpi/thermal_zone/* There should also be some modules at cpan. I experienced temperature problems from time to time, which leaded to strange problems (segfaults, the big bytecrunsher,...) The last time I met the bytecrunsher, there was to much dust on the cpu's cooler..	[reply]
Re: [OT] Monitoring a website by SFLEX (Chaplain) on Oct 03, 2007 at 10:10 UTC
requests * requests / Size For a web page check the size of the param, CGI.pm can do this for you and give an error "cgi_error()". But if you can check a larg request and stop it before Perl touches it, i guess it would be a lot safer.	[reply]