Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I wonder if anybody has attempted to do the following before and if so whether they could share there experiences,recommendations with me.
I'm slightly more advanced than novice at Perl but not much
I need to monitor some log files written to a predefined path (unix). I've written a parser to get information from these log files. The problem I have is how to keep track of the files that I've already parsed.
I have read-only access to the files. The files may be overwritten (in which case the inode and modification time will change) but the name will remain the same. At the time of running the parser (say hourly) I only want to read information from the files I haven't already processed.
The parser could be run from a scheduler or be a daemon process that wakes up at specified intervals
Hence the need to keep track of files already processed (bearing in mind that a pre-processed file may have been replaced by a file of the same name but with different contents since the last parser run).
The question is what's the best way in handling this, suggestions etc..? I've looked at using File::Compare as perhaps a starter.

Replies are listed 'Best First'.
Re: keeping track of log files
by blazar (Canon) on May 19, 2006 at 13:51 UTC

    What have you tried thus far? The kind of task you are thinking of -keeping track of files already processed, along with times- tends to make me think of a hash. Since this info must be made persistent, just try one of the various serialization modules available from CPAN, which may range e.g. from Storable to YAML::Syck, depending on your actual needs.

      Interesting, I'll have to look at Storable for some of my own stuff.
Re: keeping track of log files
by dsheroh (Monsignor) on May 19, 2006 at 15:03 UTC
    It looks to me like you may not really need to keep track of what you've already done at all. Simply keeping track of when you last ran and then examining all files which have been modified since then should accomplish the same thing while also being substantially simpler.

    Caveats:

    1) This assumes that the time spent in each processing run is minimal. If it takes 5 minutes, then, yeah, it won't work because the logs are probably going to change during that time.

    2) If your situation is such that you need to provide 'proof' that certain files have been processed at certain times then, of course, you'll need the records to satisfy those requirements.

Re: keeping track of log files
by nimdokk (Vicar) on May 19, 2006 at 13:44 UTC
    I've done something similar. What I do may not be the most ideal, but it works. I have a job that runs once a day to move logs from Directory A on a specific server and moves them to a central log repository on another server (this runs on several servers). When the job starts, it reads in a file that contains a list of the logs that have already been moved. Then it gets a list of files in the directory and moves anything that is not on the list. Not perfect since it relies on an external file, but it works pretty nicely. What you might want would be a list of files that have already been processed stored in some sort of Perl data structure.