Re^2: methods of recovering from ram issues

This, with or without the cashews, is definitely a very important point. It is definitely possible to “over think” a recovery scenario to the point that you actually create a system that is quite incapable of practical recovery from anything at all. Usually, the problem is that, not only do you not know where you were when the failure occurred (not to mention that “failures do not all happen neatly at the same time...”), but you have no idea where the “last known good point” is, either. Therefore, perhaps the best overall strategy might be one based on what IBM used to call, “checkpoint/restart.” The system moves from one known-good state (“checkpoint”) to the next, and it has the well-defined ability, first, to get itself back to that checkpoint, and then, to continue (“restart”) from that place (and, perhaps, no other).

“Reliance upon RAM,” stated in a slightly different way, might well be considered as “reliance upon guaranteed-writes,” which of course might not be guaranteed at all. To put it yet another way, it boils down to uncertainty about the system-state as currently fixed to disk vs. the perfectly dynamic state currently not-so fixed in RAM. Information that is fixed only in RAM obviously cannot be recovered if power is lost, therefore the true concern is the data-integrity of whatever is presently available on disk.

Excellent post.