bigmoose has asked for the wisdom of the Perl Monks concerning the following question:

hello monks!

I came into programming a few months ago, having been a support engineer in a HA (high availability) environment. So this means having to look for 'single points of failure' in order to minimise downtime and disruption as much as possible.

im current writing a script which involves storing data in memory via hashes for a period of time while the script finishes.

.

in my eyes it's a point of failure.. isn't it?.. if the application crashes for some reason.. due to power or perhaps even faulty memory.. how can i recover from that point

tl;dr are there any programmers concepts I should read about? concerning recovering from situation where data tempoaraly stored in ram is lost due to environmental issues?

Replies are listed 'Best First'.
Re: methods of recovering from ram issues
by BrowserUk (Patriarch) on Dec 20, 2011 at 12:23 UTC
    involves storing data in memory via hashes for a period of time ... if the application crashes for some reason.. due to power or perhaps even faulty memory.. how can i recover ...

    What is the source of that data? How long does it take to (re-)build the hashes?

    how can i recover from that point

    Do you need to recover from that point?

    I've seen people go to extraordinary lengths to try and checkpoint the state of an application at regular intervals during its runtime with a view to allowing it to pick-up from where it left off in the event of failure. Where in many cases, it is easier, cheaper and more reliable to simply run the process again from scratch.

    Not always of course, but surprisingly frequently the economics of building ever more elaborate process monitoring, check-pointing, on-the-fly replication, load-balancing, redundancy and fail-overs into a system simply do not stand up to scrutiny. Each new layer of defensive mechanisms adds both cost and complexity to the system, and complexity is the absolute antithesis of reliability. And that growth in complexity (and therefore cost) is not linear, but rather exponential as the 'need' to: monitor the monitor; backup the backup; have redundancy for the redundant; becomes institutionally imperative.

    And in the end, it's never the thing you thought might fail, that does. I still have memories of many very long hours freezing my fingers off monitoring a data-scope before discovering that the lift-motor in the unit next door, the other side of a concrete wall a couple of feet thick, would produce copious amounts of broad spectrum RF interference whenever they took a delivery of peanuts and cashews. (The are very dense products which made it easy to overload their lift.) At that point, all network communications between the multiply redundant fail-over servers ceased, their heart-beat checks failed, and they all tried to step in to take over from each other. Result: When the RFI ceased, all the servers were trying to do all the jobs and everything got corrupted.

    The best advice I can give is: make each process as simple as possible and have it be driven by the arrival of its data; have it process input data in discrete chunks; never discard your source data.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      This, with or without the cashews, is definitely a very important point.   It is definitely possible to “over think” a recovery scenario to the point that you actually create a system that is quite incapable of practical recovery from anything at all.   Usually, the problem is that, not only do you not know where you were when the failure occurred (not to mention that “failures do not all happen neatly at the same time...”), but you have no idea where the “last known good point” is, either.   Therefore, perhaps the best overall strategy might be one based on what IBM used to call, “checkpoint/restart.”   The system moves from one known-good state (“checkpoint”) to the next, and it has the well-defined ability, first, to get itself back to that checkpoint, and then, to continue (“restart”) from that place (and, perhaps, no other).

      “Reliance upon RAM,” stated in a slightly different way, might well be considered as “reliance upon guaranteed-writes,” which of course might not be guaranteed at all.   To put it yet another way, it boils down to uncertainty about the system-state as currently fixed to disk vs. the perfectly dynamic state currently not-so fixed in RAM.   Information that is fixed only in RAM obviously cannot be recovered if power is lost, therefore the true concern is the data-integrity of whatever is presently available on disk.

      Excellent post.

Re: methods of recovering from ram issues
by ww (Archbishop) on Dec 20, 2011 at 11:55 UTC
    Perhaps reading about "cost:benefit analysis" would be a good place to start.

    Before I'd consider complicating a script that's supposed to be doing "real" work with the seatbelts, warning labels, failsafes, et al to deal with even the two points of failure you've mentioned, I'd want to know how likely those are and what it will cost me if they occur without additional protection.

    For example, re power failure: will the code (and storage and CPU cycles) to protect against loss of power, spike, surge or whatever cost more than
        a) ...the cost of restarting the process from whatever was the state of affairs prior to the hypothetical failure? (Yes, this improperly ignores the costs incurred by whatever consequences of the outage develop because of or during the outage.)
        b) ...an UPS and backup generator?

    If the s/w is doing reactor control, yeah, we probably better have failsafe s/w... but if the s/w is adjusting the "lights on" time in a hobby-farm chicken coop, maybe it's not worth the programmer time to write the failsafes.

    Bottom line: I know this doesn't answer the question you asked... but -- by way of apology -- hope I've offered some thoughts worth considering before you go charging off to build the next 'unsinkable' Titanic or Fukushima Dai-Ichi.

Re: methods of recovering from ram issues
by Utilitarian (Vicar) on Dec 20, 2011 at 11:23 UTC
    Do you need to maintain a queue of pending operations and only remove actions from the queue when they are completed?

    Do you need to create a Storable hash which can be recovered in the event of a failure so that you can continue processing ...

    The approach you take to protecting your data depends on the nature of the problem.

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: methods of recovering from ram issues
by leuchuk (Novice) on Dec 20, 2011 at 12:58 UTC

    Sorry, to answer your question I'd need some more information.

    There may be several points of failure: CPU, RAM, IO, network connections, technical problems (old tapes, disk problems)...

    Some thoughts: Your script shouldn't be the reason for reducing the availability of your HA system. So it shouldn't consume too much of the resources.

    If your script uses a lot of RAM you should save results on a disk. use Storable is one possibility, a database like SQLite / use DBI could be another. If you have already a database running on the system that would be another chance.

    If there's a tremendous amount of IO on the system and you have no problems with RAM just keep it there. Disadvantages: No recovery possible and the output is something like a printer.

    If you have (a) both a lot of IO and little RAM or (b) no real idea about what happens on your system you have to try. On Unices you can get the first impression with programs like iostat or top.

    If you are still not sure about your system just remember a few ideas. Don't collect too much information! Take one or two topics and analyze. Why? When your script collects a lot or all of the interesting part of information it will be touching the 'single point of error' and you (may) have an ocean full of information in which you may drown.

    If you have some batch processes running you may have to distinguish between "daytime operations" and "nightly operations".

    That "don't collect too much information" refers to a possible problem with CPU cycles. Your script shouldn't change the CPU work load to a status different from "acceptable" (or better).

    Don't think that there is exactly one problem. I have a database server. While backing up the problem is the network while the backup files are sent to the backup server with the tape drives. In the night the CPU is the problem when a certain job is running which is calculating a lot. On the daytime the server has congested IO at 7 A.M. for a few minutes because that's the time when we get a bunch of data from partners. On daytime there are sometimes problems when users block each others database access. Monitoring the system is a continuing process as the numbers are rising all the time. The server which they used six years ago couldn't do the workload of today...

Re: methods of recovering from ram issues
by zwon (Abbot) on Dec 20, 2011 at 13:10 UTC

    Maybe you should switch to some external data storage like Redis

Re: methods of recovering from ram issues
by Sewi (Friar) on Dec 20, 2011 at 15:19 UTC

    Hi bigmoose,

    I sign the other posts above: Clearly check if you really need to restart the process at it's point. If you checkpoint and restore a Perl script (I think I saw a module for this years ago, but can't find it now), not everything gets restored starting at your database connections.

    If all you need is to secure few hashes, writing a tie()-module might be an option.

    You could store your data in a local hash within the tied module and copy all write access to some persistant data source (like a database, hard disk, other server, ...)

    But again... is it worth the time?

Re: methods of recovering from ram issues
by TJPride (Pilgrim) on Dec 20, 2011 at 16:31 UTC
    The whole point of using RAM in the first place is to minimize disk I/O. It seems rather self-defeating to back things up on disk as well, unless your process takes an extremely long time to run and recreating the hash is also slow and/or impossible. This is one of those situations where we need more specifics about what your application does and where it's running to give a good answer.