Maybe tag the splitted data as preliminary (for example through apppending .tmp to a filename) until the script is able to finish? The first thing the script would do is to remove all .tmp files, the last thing the script would do is rename .tmp files to the names without .tmp.
| [reply] |
Seems to me there may be a race condition here:
if the '.tmp' files are renamed before the original log file is deleted, then the data will be repeated if the process dies before the log file is deleted.
if the original log file is deleted before the '.tmp' files are renamed, then the data will be lost if the process dies before the '.tmp' files are renamed.
If the name of the '.tmp' files is the same as the log file the data came from, then it's the existence of the log file that matters -- in fact, I don't think you need the '.tmp' suffix. If the process dies at any stage before the log file is deleted, then it can safely be rerun. (I'm assuming each log file has a unique name over time -- for example, by including the date/time of its creation.)
If the requirement is to append stuff from each log file to one or more other files, then I would create an auxiliary file (whose name is related to the current log file being processed) and append to it the name and current length of each file written to (and close the auxiliary file -- expecting that to flush the result to disc). The auxiliary file would be deleted after the related log file. When starting the process, if an auxiliary file is found then:
if the related log file exists, then the process needs to be restarted, truncating each file recorded in the auxiliary file to it's original size.
if the related log file does not exist, then the process died after completing all useful work, so the auxilliary file can be deleted.
This does depend on the auxiliary file being written away to stable storage when it's closed, or at least before any changes to the related files makes it to stable storage. It's also assuming that all the updated files make it to stable storage reliably after being closed, so that data is not lost when the original log file is deleted. If those are concerns, the problem is a lot bigger !
| [reply] |
Since you're sorting the file contents you must have enough memory to hold and sort the entire set of log files. If you can avoid sorting the data and holding the complete set of files in memory that would help with any future memory issues you might have. Perhaps read in enough to detect a change of timestamp and sort just on that slice if data. I'm assuming the log files are written with timestamps and more or less in sequence.
When the data is written to the DB there is presumably a timestamp field for the data in the DB. That could be used to search for the latest records already in the DB. The latest timestamp records could be deleted upon restart of the script and replaced from the logs to ensure a complete set of records for that timestamp. Subsequent timestamp data would simply come from subsequent log data.
Either that or you've got to record how may records through the files you have processed. Perhaps the database updates could be used for that as well. | [reply] |
| [reply] |
i guess so, i did not get what i was expecting from that. im really having a hard time coming up with a method to do this.
| [reply] |
| [reply] |
Do you do the actual splitting manually (i.e. read line by line and dump into a new file) or do you use a utility.
If its the former, I suggest you take a look at some of the options available for splitting files. If you're on Windows take a look at this package, specifically the 'split.exe' utility.
Otherwise, I suggest you investigate further into exactly why your script is dying, as that's kind of important in fixing the problem.
| [reply] |