Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

i have two scripts the first one reads from log files
(logs are created by a server i have no control over the logs creation its done periodically depending on the traffic) it splits the log data according to the time it was created and put it in four directories (one for each month - four months of data) then id deletes the initial log,
the second scripts reads these files and sorts them and then uploads it to a database then archives them.
the problem that im having is, in case while splitting the data for some reason if the script stops and it had to be started again. i need a way to start it without making duplicate log data for the once already created.
can anyone help me out on this please??
im new to perl so if someone can show me a example i would appropriate it.

Replies are listed 'Best First'.
Re: file state management and recovery
by jethro (Monsignor) on Feb 05, 2009 at 15:05 UTC

    Maybe tag the splitted data as preliminary (for example through apppending .tmp to a filename) until the script is able to finish? The first thing the script would do is to remove all .tmp files, the last thing the script would do is rename .tmp files to the names without .tmp.

      Seems to me there may be a race condition here:

      • if the '.tmp' files are renamed before the original log file is deleted, then the data will be repeated if the process dies before the log file is deleted.

      • if the original log file is deleted before the '.tmp' files are renamed, then the data will be lost if the process dies before the '.tmp' files are renamed.

      If the name of the '.tmp' files is the same as the log file the data came from, then it's the existence of the log file that matters -- in fact, I don't think you need the '.tmp' suffix. If the process dies at any stage before the log file is deleted, then it can safely be rerun. (I'm assuming each log file has a unique name over time -- for example, by including the date/time of its creation.)

      If the requirement is to append stuff from each log file to one or more other files, then I would create an auxiliary file (whose name is related to the current log file being processed) and append to it the name and current length of each file written to (and close the auxiliary file -- expecting that to flush the result to disc). The auxiliary file would be deleted after the related log file. When starting the process, if an auxiliary file is found then:

      • if the related log file exists, then the process needs to be restarted, truncating each file recorded in the auxiliary file to it's original size.

      • if the related log file does not exist, then the process died after completing all useful work, so the auxilliary file can be deleted.

      This does depend on the auxiliary file being written away to stable storage when it's closed, or at least before any changes to the related files makes it to stable storage. It's also assuming that all the updated files make it to stable storage reliably after being closed, so that data is not lost when the original log file is deleted. If those are concerns, the problem is a lot bigger !

Re: file state management and recovery
by Anonymous Monk on Feb 05, 2009 at 15:49 UTC
    Since you're sorting the file contents you must have enough memory to hold and sort the entire set of log files. If you can avoid sorting the data and holding the complete set of files in memory that would help with any future memory issues you might have. Perhaps read in enough to detect a change of timestamp and sort just on that slice if data. I'm assuming the log files are written with timestamps and more or less in sequence.

    When the data is written to the DB there is presumably a timestamp field for the data in the DB. That could be used to search for the latest records already in the DB. The latest timestamp records could be deleted upon restart of the script and replaced from the logs to ensure a complete set of records for that timestamp. Subsequent timestamp data would simply come from subsequent log data. Either that or you've got to record how may records through the files you have processed. Perhaps the database updates could be used for that as well.
Re: file state management and recovery
by planetscape (Chancellor) on Feb 05, 2009 at 19:22 UTC
      i guess so, i did not get what i was expecting from that. im really having a hard time coming up with a method to do this.

        Perhaps you should have told us what you expected? A whole lot more back story would help too. There is only so much we can glean by looking into crystal balls or the internals of chickens.


        Perl's payment curve coincides with its learning curve.
Re: file state management and recovery
by Zenshai (Sexton) on Feb 05, 2009 at 18:47 UTC

    Do you do the actual splitting manually (i.e. read line by line and dump into a new file) or do you use a utility.

    If its the former, I suggest you take a look at some of the options available for splitting files. If you're on Windows take a look at this package, specifically the 'split.exe' utility.

    Otherwise, I suggest you investigate further into exactly why your script is dying, as that's kind of important in fixing the problem.