Well, I've found that no matter how much planning you do, there's always something left to go wrong. I also don't know what you're using for your monitoring, but some of them have a way to keep them from sending out alerts, so that when you know something's down, it won't keep alerting. (A couple of the folks I used to work with were contemplating on making something that would only allow you to set a timer to supress the warnings, and would then escalate if the problem wasn't resolved in time.... after we found out someone had disabled the warnings on their systems, because they didn't like getting paged 'all the time'.)
Depending on how the paging goes out, if it's through a mail gateway, you might also be able to shut down the local mailqueues, if it doesn't send straight to another SMTP server. (fortunately, I have the luxury of being able to sit and think about these things, as opposed to when you've been up all night, and you're hitting the end of your planned outage window, and all you want to do is get out of there, and crash for the night)
Anyway, I find you learn best from doing things. (and I learn the most from making mistakes -- as I don't want to repeat them). No matter how much you read, take classes, or plan for things, it's nothing like the real thing, when you're pulling 16hr days for 2 weeks straight, trying to recover a 30k user mailstore, watching three fibrechannel controllers that you connect to mysteriously fail one after another. Or when you power down your entire 100+ server data center (so they could install a UPS bypass switch...yet we had to shut down the next year for the batteries to be serviced), only to have your terminal server not come back up, so you're rolling two WSYE terms about on carts as you bring the machines up two at a time.
In reply to Re^5: Reinventing the wheel
by jhourcle
in thread Reinventing the wheel
by bageler
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |