Where I work, we've recently had a big push to improve and modernize our approach to systems monitoring. I thought i'd take a little time to share some of the approaches we've come up with, and how they're benefitting us.

In most medium-to-large production environments, you generally find one or more systems that have regularly scheduled jobs that run. Cron does a nice job of this, but suffers from one fatal flaw. It's not human. :) Should one of these regularly-scheduled jobs kick off and suddenly die or otherwise fail to run at all, it's up to you, or cron, to either funnel the results of stdout/stderr to someone, or up to the script itself to generate some sort of notification in the event that it was unable to run successfully. But, what happens when things stop working, and it's only after days, weeks, or months that it's noticed by anyone? We've actually developed a third way, called Monolith, to detect when this state happens.

Suppose you gave each of your regularly scheduled jobs the ability to call home to a centralized database. After a while, a bit of a track record would begin to develop.. maybe after 3 or 4 executions....a track record that could tell you when the next invocation of that command can be expected to show up.

Enter Monolith. Monolith is a two-part tool. The first part, a simple call-home script, takes just one argument -- an "entity name", which usually equals the name of the script itself. When run, it makes a connection to a MySQL database, and adds a row to a table saying, "Hi! I'm {entityName} on {host}, and it's currently {time} where I am". The other part of the tool is a script that watches this database, looking for instances when an entity has stopped calling home...In other words, if you know that entity "foobar.pl" usually checks in every 200 seconds, and the last time it checked in more than 200 than seconds ago, you know that there's a problem with foobar.pl...and an alert can be generated to that effect. Incidentally, we set our detection threshold at 20%..Meaning, if something that is known to check in every 100 seconds hasn't checked in for the past 120 seconds, an alert is generated.

Here's the call-home script:

#!/usr/bin/perl ## ## Monolith written 082213 by Bowie J. Poag ## ## Monolith is a mechanism that allows regularly-scheduled scripts to +be monitored remotely. For every entity (script, command, whatever) y +ou want monitored, call this command. ## ## Usage: ./monolith.pl <scriptname> ## ## Example: /usr/local/bin/monolith.pl tripwire ## ## English: Make an entry in the Monolith database saying that Tripwir +e just ran. ## use Mysql; $monolithDBHandle=Mysql->connect('tmcpmonitordb','Monolith','xxxxx','x +xxxxxxxxxx'); if ($monolithDBHandle==0) { print "Monolith: Unable to connect to DB.\n"; } $timeStamp=time; $hostName=`hostname`; chomp($hostName); $entityName=$ARGV[0]; $checkExistQuery=$monolithDBHandle->query("SELECT * FROM Entities WHER +E entityName='$entityName' AND entityHostName='$hostName';"); while (@checkExist=$checkExistQuery->fetchrow_array) { $lastSeen=$checkExist[3]; $x++; } if ($x==0) ## No rows returned. Hmm. This means we're checking in for +the first time, so, let's create a new entry for ourselves in the DB. { $updateMonolith=$monolithDBHandle->query("INSERT INTO Entities + (entityName,entityHostName,entityLastSeen,entityFrozen) VALUES ('$en +tityName','$hostName','$timeStamp','0');"); } else { $updateMonolith=$monolithDBHandle->query("UPDATE Entities SET +entityLastSeen='$timeStamp' WHERE entityName='$entityName' AND entity +HostName='$hostName';"); } $entityDelta=$timeStamp-$lastSeen; $updateMonolith=$monolithDBHandle->query("INSERT INTO Events (timeStam +p,hostName,reportingEntity,reportingDelta) VALUES ('$timeStamp','$hos +tName','$entityName','$entityDelta');");

We have taken this idea, the ability to predict when something should have called home, but hasn't, and greatly expanded upon it. Monolith is now a status dashboard that gives near-realtime status on over 200 different entities running across about 30 different hosts. To begin monitoring anything, all it takes is adding a single line to the script you want monitored, and you're done. A more clever use would be to only call home to Monolith if the script was successful; that way, if the script ran but failed operationally for some reason, that can be detected and resolved. Anything which runs at regular intervals, and whose state can be conveyed in terms of on/off, successful/not successful, or present/not present, can be visualized.

Here's what our front-end to Monolith looks like, in-house:

http://i.imgur.com/kf7nYA6.png

Our organization now has 200+ more pairs of automated eyes carefully ensuring that everything we have is working as expected, and alerting us when it's not. It's and already bared substantial fruit--On instances where something systemic had broken, it affected the ability of several scripts on several different hosts to run. It helped greatly to have a visual map of what was broken, so that we could be 100% confident that we've fixed the problem in every place.

tl;dr - We have a tool that tracks regularly scheduled tasks to ensure they're calling home at regular intervals. When they deviate from the expected drum pattern they've created for themselves over time, or stop phoning home alltogether, we know about it immediately, versus being caught off-guard and finding out at some point down the road.

Cheers,

Bowie


In reply to Monolith: A Clever Tool For Monitoring Regularly Scheduled Tasks by bpoag

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.