in reply to Parsing arbitrarily-formatted timestamps out of log file entries
I ended up removing the dates by generating regex from the strftime format to match them, using this:
sub timestamp2regex { my $exp = shift; my %metareplacements = ( 'D' => '%m/%d/%y', 'F' => '%Y-%m-%d', 'r' => '%I:%M:%S %p', 'R' => '%H:%M', 'T' => '%H:%M:%S' ); my %replacements = ( 'a' => '[[:alpha:]]+', 'A' => '[[:alpha:]]+', 'b' => '[[:alpha:]]+', 'B' => '[[:alpha:]]+', 'd' => '\d{2}', 'e' => '[\d\s]\d', 'g' => '\d{2}', 'G' => '\d{4}', 'h' => '[[:alpha:]]+', 'H' => '\d{2}', 'I' => '\d{2}', 'j' => '\d{3}', 'k' => '[\d\s]\d', 'l' => '[\d\s]\d', 'm' => '\d{2}', 'M' => '\d{2}', 'p' => '[A-Za-z.]{2,}', 'P' => '[A-Za-z.]{2,}', 's' => '\d+', 'S' => '\d{2}', 't' => '\t', 'u' => '\d', 'U' => '\d{2}', 'V' => '\d{2}', 'w' => '\d', 'W' => '\d{2}', 'y' => '\d{2}', 'Y' => '\d{4}', 'z' => '[+-]\d{4}', 'Z' => '[[:alpha:]]*', '%' => '\%' ); $exp = quotemeta($exp); $exp =~ s/\\\%\\?(.)/ if (defined $metareplacements{$1}) { timestamp2regex($metareplacements{$1}); } elsif (defined $replacements{$1}) { $replacements{$1}; } else { croak "Unsupported or unrecognized timestamp format token: + \%$1."; }/eg; return $exp; }
(This turned out to be much easier to write than I expected, after I gave up with locales.)
This isn't completely ideal, because it doesn't accept anything locale-related (it will croak on %c, %E, %O, %x, and %X), but I don't think those are actually going to be used. After writing this, I discovered Regexp::Common::time, which appears to be exactly what I was after (and somewhat what this code does), but it's much longer than my code, and I'm not sure if it handles certain things (like non-English AM/PM) as well as my code does. If I run into any locale issues with mine, though, I'll probably switch to that.
|
|---|