It would be very helpful if you could explain more about what you are doing (in term of the application) and how much processing it takes for each line of the input to reach the "done" status? and how many lines that you are processing.
From your description, when you finish processing all lines in the input file, there will be an equal number of lines in the output file, i.e. one input line results in one output line. True?
I am going to answer your question, but I have to say that many times the right answer is make the processing so darn fast that the ability to restart processing from some middle point doesn't matter at all (i.e. just starting over is ok).
From the description, I see no reason to keep any significant amount of data in memory, much less a "lot".
Read each from the input one line at a time, process it, write it to the done file, read next line. The only data in memory is the current line.
If the program crashes, for the error recovery, open the "done" file for read only. Start a new done file to write to. Copy lines from the previous "done" file to the "new done" file, but always keep one line behind. The last "done" line may be corrupt and rather than try to figure out whether it is good or not, just throw it away. Close the previous "done" file.
Now we know how many "good lines" have been processed. So open the input file and just skip that number of lines. BTW, Perl keeps track of line in the $. variable, but a simple line counter is a very "cheap" thing to implement.
What I have described is a very simple stategy, but the devil is in the details. What happens if there is a crash while the recovery of a previous crash is going on? It turns out that is not just hypothetical, lots of crashes are caused by systemic errors that your code has no control over. And there are all sorts of issues concerning cleaning up intermediate files.
Anyway, one of my points is that adding the error recovery will either impact the performance or the complexity of the application and probably both.
I would strong recommend recoding the application so that it uses very little memory and runs as fast as possible. The error recovery stuff will add 10x to the complexity or the code.
Post some "lean and mean" Perl code with some detail of the app. The first focus should be to make it so fast that it doesn't have to be restarted in the middle. If that cannot be achieved, then let's talk about the restart code and to how to implement it efficiently and reliably. |