in reply to methods of recovering from ram issues

involves storing data in memory via hashes for a period of time ... if the application crashes for some reason.. due to power or perhaps even faulty memory.. how can i recover ...

What is the source of that data? How long does it take to (re-)build the hashes?

how can i recover from that point

Do you need to recover from that point?

I've seen people go to extraordinary lengths to try and checkpoint the state of an application at regular intervals during its runtime with a view to allowing it to pick-up from where it left off in the event of failure. Where in many cases, it is easier, cheaper and more reliable to simply run the process again from scratch.

Not always of course, but surprisingly frequently the economics of building ever more elaborate process monitoring, check-pointing, on-the-fly replication, load-balancing, redundancy and fail-overs into a system simply do not stand up to scrutiny. Each new layer of defensive mechanisms adds both cost and complexity to the system, and complexity is the absolute antithesis of reliability. And that growth in complexity (and therefore cost) is not linear, but rather exponential as the 'need' to: monitor the monitor; backup the backup; have redundancy for the redundant; becomes institutionally imperative.

And in the end, it's never the thing you thought might fail, that does. I still have memories of many very long hours freezing my fingers off monitoring a data-scope before discovering that the lift-motor in the unit next door, the other side of a concrete wall a couple of feet thick, would produce copious amounts of broad spectrum RF interference whenever they took a delivery of peanuts and cashews. (The are very dense products which made it easy to overload their lift.) At that point, all network communications between the multiply redundant fail-over servers ceased, their heart-beat checks failed, and they all tried to step in to take over from each other. Result: When the RFI ceased, all the servers were trying to do all the jobs and everything got corrupted.

The best advice I can give is: make each process as simple as possible and have it be driven by the arrival of its data; have it process input data in discrete chunks; never discard your source data.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

  • Comment on Re: methods of recovering from ram issues

Replies are listed 'Best First'.
Re^2: methods of recovering from ram issues
by locked_user sundialsvc4 (Abbot) on Dec 21, 2011 at 00:33 UTC

    This, with or without the cashews, is definitely a very important point.   It is definitely possible to “over think” a recovery scenario to the point that you actually create a system that is quite incapable of practical recovery from anything at all.   Usually, the problem is that, not only do you not know where you were when the failure occurred (not to mention that “failures do not all happen neatly at the same time...”), but you have no idea where the “last known good point” is, either.   Therefore, perhaps the best overall strategy might be one based on what IBM used to call, “checkpoint/restart.”   The system moves from one known-good state (“checkpoint”) to the next, and it has the well-defined ability, first, to get itself back to that checkpoint, and then, to continue (“restart”) from that place (and, perhaps, no other).

    “Reliance upon RAM,” stated in a slightly different way, might well be considered as “reliance upon guaranteed-writes,” which of course might not be guaranteed at all.   To put it yet another way, it boils down to uncertainty about the system-state as currently fixed to disk vs. the perfectly dynamic state currently not-so fixed in RAM.   Information that is fixed only in RAM obviously cannot be recovered if power is lost, therefore the true concern is the data-integrity of whatever is presently available on disk.

    Excellent post.