in reply to methods of recovering from ram issues
Sorry, to answer your question I'd need some more information.
There may be several points of failure: CPU, RAM, IO, network connections, technical problems (old tapes, disk problems)...
Some thoughts: Your script shouldn't be the reason for reducing the availability of your HA system. So it shouldn't consume too much of the resources.
If your script uses a lot of RAM you should save results on a disk. use Storable is one possibility, a database like SQLite / use DBI could be another. If you have already a database running on the system that would be another chance.
If there's a tremendous amount of IO on the system and you have no problems with RAM just keep it there. Disadvantages: No recovery possible and the output is something like a printer.
If you have (a) both a lot of IO and little RAM or (b) no real idea about what happens on your system you have to try. On Unices you can get the first impression with programs like iostat or top.
If you are still not sure about your system just remember a few ideas. Don't collect too much information! Take one or two topics and analyze. Why? When your script collects a lot or all of the interesting part of information it will be touching the 'single point of error' and you (may) have an ocean full of information in which you may drown.
If you have some batch processes running you may have to distinguish between "daytime operations" and "nightly operations".
That "don't collect too much information" refers to a possible problem with CPU cycles. Your script shouldn't change the CPU work load to a status different from "acceptable" (or better).
Don't think that there is exactly one problem. I have a database server. While backing up the problem is the network while the backup files are sent to the backup server with the tape drives. In the night the CPU is the problem when a certain job is running which is calculating a lot. On the daytime the server has congested IO at 7 A.M. for a few minutes because that's the time when we get a bunch of data from partners. On daytime there are sometimes problems when users block each others database access. Monitoring the system is a continuing process as the numbers are rising all the time. The server which they used six years ago couldn't do the workload of today...
|
|---|