sorting logfiles by timestamp

jasonl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: sorting logfiles by timestamp by Preceptor (Deacon) on Jan 19, 2014 at 19:10 UTC
Date::Parse is your friend for this sort of operation. Because you're using US format date, you can't do an simple (numeric or string wise) sort. By preference, I'd say 'use the ISO 8601 standard date format' e.g. YYYY-MM-DD HH:MM::SS - the reason being because then this problem is trivial - it sorts both numerically and stringwise - but I realise that's not always an options, so instead: `use Date::Parse; my %sort_hash; my $line = "01/14/2014 23:44:12 <data1> <data2>"; my ( $datestr, $timestr, @rest_of_string ) = split ( /\s+/, $line ); my $unix_time = str2time ( $datestr . " " . $timestr ); print $unix_time,"\n"; $sort_hash{$unix_time} = join ( " ", @rest_of_string );` [download] I'm sure you can adapt this for a 'while' loop easily enough.	[reply] [d/l]
Re: sorting logfiles by timestamp by Jim (Curate) on Jan 19, 2014 at 21:35 UTC
I'm splitting the line and using data2 as the key for a hash… You can do this because each log record's `<data2>` is guaranteed to be unique and is therefore a viable key, right? …and then pushing date, time, and data1 into an array that is the value… So to amplify Laurent_R's fine suggestion, you're already including in the hash values (i.e., the stored data) the timestamps that will serve as proper sort keys and that you'll therefore use to sort the records later with a Guttman Rosler Transform. You just need to ensure the sort key timestamps are in an ISO 8601 format instead of in the format they're in in the logs. This ensures that when you sort the timestamps lexicographically (ASCIIbetically), they're ordered chronologically as well. `# Parse the log record... m{^(\d\d)/(\d\d)/(\d\d\d\d) (\d\d:\d\d:\d\d) (\S+) (\S+)} or die; my $timestamp = "$3-$1-$2 $4"; my $data1 = $5; my $data2 = $6; my %myHash; push @{ $myHash{$data2}{'info'} }, "$timestamp,$data1";` [download] And since it appears you intend to keep the sort key timestamps as data, you don't have to lop them off as you normally would in a Guttman Rosler Transform. So you won't really need to use a transform, per se, at all. You can just sort the records by their hash values. Jim	[reply] [d/l] [select]
Re: sorting logfiles by timestamp by Laurent_R (Canon) on Jan 19, 2014 at 20:18 UTC
This is really a typical case for Schwartzian Transform or, even better, Advanced Sorting - GRT - Guttman Rosler Transform. In brief, add at the beginning of your data the date in the YYYYMMDDHHMMSS format, sort it on it, and remove this date when you output the result.	[reply]
Re: sorting logfiles by timestamp by kcott (Archbishop) on Jan 20, 2014 at 16:40 UTC
G'day jasonl, You show no indication of what `<data2>` contains. I assume they're not unique values as you've used it as a key for an array (`@{$myHash{$data2}{info}}`). You don't say how many elements this array might hold, if you want to sort on `<data2>` nor whether `<data2>` needs to appear in the output. A representative, unordered sample of the input as well as how you'd expect that to be output would have been helpful. The following script may provide some help in formulating your solution: #!/usr/bin/env perl -l use strict; use warnings; use Time::Piece; my %myHash; my $format = '%m/%d/%Y %H:%M:%S'; while (<DATA>) { my ($date, $time, $data1, $data2) = split; my $sort_key = Time::Piece->strptime("$date $time", $format)->epoc +h; push @{$myHash{$data2}{info}}, "$sort_key:$date,$time,$data1"; } for my $key (sort keys %myHash) { print "$_,$key" for map { $_->[1] } sort { $a->[0] <=> $b->[0] } map { [ split /:/, $myHash{$key}{info}[$_], 2 ] } 0 .. $#{$myHash{$key}{info}}; } __DATA__ 01/14/2014 23:44:14 A Y 01/14/2014 23:44:12 B Y 01/14/2014 23:44:13 C X 01/14/2014 23:44:12 D X [download] Output: `01/14/2014,23:44:12,D,X 01/14/2014,23:44:13,C,X 01/14/2014,23:44:12,B,Y 01/14/2014,23:44:14,A,Y` [download] If you provide a better description of your problem, a better solution can probably be provided. The guidelines in "How do I post a question effectively?" may help you with this. -- Ken	[reply] [d/l] [select]
Re^2: sorting logfiles by timestamp by jasonl (Acolyte) on Jan 22, 2014 at 23:28 UTC
OK, I finally got a chance to try this. It works a treat, but I cannot for the life of me wrap my head around how and why, and I hate using code I can't understand (for several reasons). I'm OK until: `print "$_,$key" for map { $_->[1] } sort { $a->[0] <=> $b->[0] } map { [ split /:/, $myHash{$key}{info}[$_], 2 ] } 0 .. $#{$myHash{$key}{info}};` [download] Is there a way to write that as a more C-style for loop, even if it's pseudo-code, or is that the only way it will work? I don't follow the flow as-is, and my attempt to rewrite it ended up with only printing indices of the array. I'm also having trouble following the map { } statements, but hopefully if I can grok the way the loop is working the rest will start to make a little more sense. Thanks again.	[reply] [d/l]
Re^3: sorting logfiles by timestamp by kcott (Archbishop) on Jan 23, 2014 at 14:43 UTC
"It works a treat, but I cannot for the life of me wrap my head around how and why, and I hate using code I can't understand (for several reasons)." Firstly, I'm glad to hear you're not just blindly plugging in code you don't understand. The "`map {} sort {} map {}`" construct is known as the "Schwartzian Transform" (to which Laurent_R referred earlier in this thread). Where you encounter combinations of functions which take a list and return a list (e.g. grep, map, sort, etc.), it's often best to evaluate them in reverse order. Consider this (clunky) rewrite of that piece of code: `my @indices_of_myhash_key_info_array = 0 .. $#{$myHash{$key}{info}}; my @two_element_arrayrefs_with_sortkey_and_data = map { [ split /:/, $myHash{$key}{info}[$_], 2 ] } @indices_of_myhash_key_info_array; my @two_element_arrayrefs_sorted_by_sortkey = sort { $a->[0] <=> $b->[0] } @two_element_arrayrefs_with_sortkey_and_data; my @data_element_only_sorted_by_sortkey = map { $_->[1] } @two_element_arrayrefs_sorted_by_sortkey; for (@data_element_only_sorted_by_sortkey) { print "$_,$key"; }` [download] Hopefully that explains what is going on but feel free to ask if anything needs further explanation. -- Ken	[reply] [d/l] [select]
Re^2: sorting logfiles by timestamp by jasonl (Acolyte) on Jan 21, 2014 at 16:15 UTC
Thanks, all. I haven't had time to fully absorb the different transforms, but that sounds helpful. kcott, you've pretty much hit the nail on the head, except your output isn't sorted the way I need it to be. You've got: `01/14/2014,23:44:12,D,X 01/14/2014,23:44:13,C,X 01/14/2014,23:44:12,B,Y 01/14/2014,23:44:14,A,Y` [download] When it needs to be: `01/14/2014,23:44:12,D,X 01/14/2014,23:44:12,B,Y 01/14/2014,23:44:13,C,X 01/14/2014,23:44:14,A,Y` [download] Is that the expected behavior? (Note, for my needs in cases where there are multiple entries in the same second, they can be in any order.)	[reply] [d/l] [select]
Re^3: sorting logfiles by timestamp by kcott (Archbishop) on Jan 21, 2014 at 18:29 UTC
From the information provided, I can see no need for that complex data structure (i.e. `@{$myHash{$data2}{info}}`). This code produces the output you say you want: `#!/usr/bin/env perl -l use strict; use warnings; use Time::Piece; my @data; while (<DATA>) { my ($date, $time, $data1, $data2) = split; my $key = Time::Piece->strptime("$date $time", '%m/%d/%Y %H:%M:%S' +)->epoch; push @data, [$key, "$date,$time,$data1,$data2"]; } print for map { $_->[1] } sort { $a->[0] <=> $b->[0] } @data; __DATA__ 01/14/2014 23:44:14 A Y 01/14/2014 23:44:12 B Y 01/14/2014 23:44:13 C X 01/14/2014 23:44:12 D X` [download] Output: `01/14/2014,23:44:12,B,Y 01/14/2014,23:44:12,D,X 01/14/2014,23:44:13,C,X 01/14/2014,23:44:14,A,Y` [download] If that doesn't do exactly what you want, it should at least provide sufficient information for you to attempt a solution yourself. If you do need further help, please ensure you post the missing details. -- Ken	[reply] [d/l] [select]
Re^4: sorting logfiles by timestamp by Jim (Curate) on Jan 21, 2014 at 22:57 UTC
Re^5: sorting logfiles by timestamp by kcott (Archbishop) on Jan 22, 2014 at 14:37 UTC
Some notes below your chosen depth have not been shown here
Re^4: sorting logfiles by timestamp by Anonymous Monk on Jan 22, 2014 at 14:05 UTC