Re (tilly) 1: Best method for failure recovery?

Well I would suggest working it like this.

Divide the large job into a series of more managable tasks which have dependencies between them. Arrange to set up each task as an item that can be restarted at any point within the task without having damanged the ability of the task to go forward. (Basically this means writing each task such that it doesn't wipe out its initial data, and can clean up or overwrite the previous partial run.) Then set up a control table with the open tasks. In that control table you mark tasks that need to run, mark them as being run, run them, then mark them as done.

Now your script can be re-run as many times as you want, and will skip work that was already done. In fact you can even have your script do as much work as feasible on each run, skipping any trouble spots, so that after a human sees it the bulk of the work got done despite any issues. Plus as a bonus if you do this carefully you may get out of it the ability to run your script simultaneously on several machines.

I attest from personal experience that while writing everything in this fashion can be a lot of work, some small steps towards the control table and distinct transactions idea does a lot towards simplifying your overall program and making it capable of handling all sorts of complex failure modes robustly. (Something doesn't look right? Abort, send notification, then continue with other stuff it can do!)

I can also attest from personal experience that the various goto solutions offered remind me of some really bad systems that I have worked with. Sure, if you do everything just right, it might work. But it is inherently a fragile approach and leads to fragile code. Not what I want in a production system! (And no, I have not merely heard vague rumor that goto is a bad idea. Give me credit for having done more homework on the topic than that.)

Comment on Re (tilly) 1: Best method for failure recovery?