mr_flea has asked for the wisdom of the Perl Monks concerning the following question:
Hello, I recently embarked on a journey to parse log files of arbitrary format. I've mostly got it down now, except for the timestamp format. The only information accessible to the script regarding the timestamp format is the following:
The biggest issue hinges upon the fact that I don't know what the exact format is going to be, so the script is going to have to deal with the formatting string itself. Using this information, my original solution was to use the wonderful DateTime::Format::Strptime module. However, disaster soon struck. I discovered that with some lines in the log file, I could not reliably separate the timestamp and the text of the log entry, and thus could not figure out what to pass to Strptime and what to treat as log entry text.
My initial idea for a solution was to generate a regular expression from the date formatting string so that I could separate the date to easily glean the text of each log entry. Here is some sample code for the kind of setup I was planning:
#!/usr/bin/perl use strict; use warnings; my $timeformat = "*%H:%M:%S%% >"; # Example. my %replacements = ( '%' => '\%', 'a' => '[[:alpha:]]+', 'H' => '\d{2}', 'M' => '\d{2}', 'S' => '\d{2}' ); $timeformat = quotemeta($timeformat); $timeformat =~ s/\\\%\\?(.)/$replacements{$1}/eg; print ("The regular expression is: $timeformat\n");
However, during the writing of this, I realized that I would also have to deal with locales! (Apparently, some of the formatting tokens are locale-specific.) Additionally, since I'm already writing regex to extract the various values in the datestamp, I might as well also parse it to a DateTime object myself (as speed is a consideration, and Strptime alone is already a little slow), but this too introduces locale issues (weekday names, month names, AM/PM, etc.)
Surely there is a better way to do this? Wise monks, please release me from my insanity.
Thank you for taking the time to read.
(update)
Sorry, it looks like I've left pretty much everyone confused. In summary, here's the issue: due to the timestamps having many possible formats, I can't figure out how to reliably separate them from the rest of the line. The goal is to extract the data from the line without capturing part of the timestamp and then to parse the timestamp into a DateTime object.
And now for some sample data. (Although, I'm not sure how much help it will be...)
Here are some sample lines of input:
09:12: 5:14:29-!- {more garbage goes here} 09:12: 5:14:37 09:12: 5:14:37
In this data sample, the first timestamp is "09:12: 5:14:29" corresponding to the format "%y:%m:%e:%H:%M". The second two lines have no data.
Here are some more (with a different timestamp format):
2008-12-12 00:39 * {more stuff here} 2008-12-12 01:17 < {data here} 2008-12-12 01:30 2008-12-12 01:31
The format in this sample is "%F %H:%M " (with an extra space at the end), and the data for the first line is " * {more stuff here}" (with a space at the beginning). On the second line, the data is "< {data here}". The last two lines have no data, only timestamp.
Since I only care about separating the timestamp and the rest of the line, I don't have to actually parse the varying data formats. I just need to somehow parse and remove the timestamp portion of each line, given the strftime format. In the examples, the strftime formats (which are accessible to the script) can be converted to the two regular expressions below, respectively:
(\d{2})(\d{2})([\d\s]\d)(\d{2})(\d{2}) (\d{4})\-(\d{2})\-(\d{2})\ (\d{2})\:(\d{2})\
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by GrandFather (Saint) on Dec 11, 2009 at 23:19 UTC | |
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by Ieronim (Friar) on Dec 12, 2009 at 20:03 UTC | |
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by mr_flea (Novice) on Dec 12, 2009 at 23:01 UTC | |
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by Anonymous Monk on Dec 12, 2009 at 02:53 UTC | |
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by chuckbutler (Monsignor) on Dec 11, 2009 at 23:44 UTC |