ImJustAFriend has asked for the wisdom of the Perl Monks concerning the following question:

Hello again Monks. I have a task that on the surface seems reasonable (I've done it before with different data). I need to merge 2 or more logfiles into one cohesive log and sort by date/time. The issue I am running into is the formatting of the date in the log files is weird. They look like:
Tue Oct 16 17:48:03 2018
I am using the sort cmp method of sorting, but it fails in that the 2 files don't come together. In my logs (simplified, only showing date/time), I see:
[file1] TIMESTAMP=Wed Oct 5 04:08:28 2018... ... [file1] TIMESTAMP=Fri Nov 2 14:11:28 2018... [file2] TIMESTAMP=Tue Oct 16 17:10:00 2018... ... [file2] TIMESTAMP=Fri Nov 2 14:11:03 2018...
EDIT EDIT EDIT EDIT EDIT
I left the formatting of the log files out, in case it helps:
<13>{Mangled Date/Time} <source>[pid]: TIMESTAMP={I'm using this timestamp} MSGCLS= Title= Severity= message = <message part a> <message part b> ... Message Id= END OF REPORT
The server/script sees each line as individual lines with newline between them...
EDIT EDIT EDIT EDIT EDIT My script looks like this:
foreach my $log (@CommonLogFiles) { print "Now Processing $log....\n"; open LOG, "$datadir/$log" or die "LOG $log: $!\n"; print "Just opened $datadir/$log...\n"; $linecnt = 0; while (<LOG>) { chomp $_; $linecnt++; if ( $linecnt == "1" ) { $line .= "[$log] "; } $line .= $_; push @lines, "$line\n"; #print "Just added " . $line . " to the lines array...\n"; $line = ""; $linecnt = 0; } close LOG; } print APPLOG "Files parsed and data inserted into array for further pr +ocessing...\n"; @sortedlines = sort { ($a =~ /^\<\d+\>\w+\s+\d+:\d+:\d+\s+\w+:TIMESTAM +P\=(\w+\s+\w+\s+\d+\s+\d+:\d+:\d+\s+\d+).*MSGCLS.*$/m)[0] cmp ($b =~ +/^\<\d+\>\w+\s+\d+:\d+:\d+\s+\w+:TIMESTAMP\=(\w+\s+\w+\s+\d+\s+\d+:\d ++:\d+\s+\d+).*MSGCLS.*$/m)[0] } @lines; print SORTED "@sortedlines\n"; print APPLOG "Log files merged together with events in time order...\n +"; close SORTED;
Anyone have any ideas? I would appreciate the help!!

Replies are listed 'Best First'.
Re: Merge 2 or more logs, sort by date & time
by haukex (Archbishop) on Nov 02, 2018 at 16:30 UTC

    You can use Time::Piece or DateTime::Format::Strptime to parse the dates into an object (the advantage of DateTime being that it has better time zone support). If the files are small enough to fit into memory, you can use sort and a Schwartzian transform to sort them.

    use warnings; use strict; use Time::Piece; use List::Util qw/shuffle/; my @input = shuffle( "Wed Oct 5 04:08:28 2018", "Fri Nov 2 14:11:28 2018", "Tue Oct 16 17:10:00 2018", "Fri Nov 2 14:11:03 2018", ); my @output = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, Time::Piece->strptime($_, '%a %b %d %H:%M:%S %Y') ->epoch ] } @input; print "$_\n" for @output; __END__ Wed Oct 5 04:08:28 2018 Tue Oct 16 17:10:00 2018 Fri Nov 2 14:11:03 2018 Fri Nov 2 14:11:28 2018
Re: Merge 2 or more logs, sort by date & time
by bliako (Abbot) on Nov 02, 2018 at 16:54 UTC

    You delegate the comparison logic to cmp but cmp is not psychic.

    Your dates seem to be parsable by this module DateTime::Format::x509 (if you omit the day-name). So first try to parse the date of each log-line and create a DateTime object from it. Like so (straight from the manual):

    use DateTime::Format::x509; my $f = DateTime::Format::x509->new(); my $dt = $f->parse_datetime('Mar 11 03:05:34 2013 UTC');

    If your log files all come from a single timezone then just use any timezone (e.g. UTC) string in calling parse_datetime() (I do not see timezone information in your logs).

    First create a parser obj before any date parsing:

    my $dtparser = DateTime::Format::x509->new();

    Then, in the loop of reading log-lines I would change  push @lines, "$line\n"; to:

    $line =~ /^\<\d+\>\w+\s+\d+:\d+:\d+\s+\w+:TIMESTAMP\=\w+\s+(\w+\s+\d+\ +s+\d+:\d+:\d+\s+\d+).*MSGCLS.*$/m; $datestr = $1; # I use your regex but with day-name removed from brack +et, e.g. Oct 5 04:08:28 2018 my $dtobj = $dtparser->parse_datetime($datestr); my $epoch = $dtobj->epoch(); push @lines, [$line, $epoch]; # Line AND epoch saved into the array, m +aybe dtobj too for completeness

    Now you can compare the epochs numerically:

    @sortedlines = sort { $a->[1] <=> $b->[1] } @lines;

    Bonus: note how in the way you do the comparisons, each time you compare 2 lines a regex needs to be applied for each line in order to extract date-string. This means that in comparing Line1 with Line2, Line1 with Line3 and Line2 with Line3 you applied the regex 6 times whereas 3 times would have been sufficient! That's why before sorting it is best to transform the data to the actual comparison format (e.g. the date-string) first before running sort. In this way you save processing time (in the expense of some space).

    Edit: haukex's answer mentions the above more succinctly (and beat me to it by a few minutes).

    bw, bliako

Re: Merge 2 or more logs, sort by date & time
by 1nickt (Canon) on Nov 02, 2018 at 15:50 UTC

    Hi, if the format used for the date in each line is the same, I suggest you parse the dates and sort by comparing them either as epoch values or with a built comparator, eg using DateTime or the core module Time::Piece.

    Update: Here's a slightly silly example. (It's silly because you shouldn't be using a string sort for this, but rather, offloading the log entries into a database (eg using DBD::SQLite) and doing your sorting there.)

    Parsing the dates each time the sort sub made a comparison would be grossly inefficient so I replace them with epoch values and then revert after the sort. This is not to say that sorting with a sub that uses regular expressions is not also hugely inefficient ;-)

    use strict; use warnings; use feature 'say'; use Time::Piece; use Data::Dumper; my $fmt = '%a %b %d %T %Y'; say for map { s/ (?<=TIMESTAMP=) (\d+) / localtime($1)->strftime($fmt) /arex } sort { ($a =~ /(?<=TIMESTAMP=) (\d+)/ax)[0] <=> ($b =~ /(?<=TIMESTAMP=) ( +\d+)/ax)[0] } map { chomp; s/ (?<=TIMESTAMP=) (\w+\s+\w+\s+\d+\s+\d+:\d+:\d+\s+\d+) / Time::Piece->strptime($1, $fmt)->epoch /arex } <DATA>; __END__ ...bla TIMESTAMP=Wed Oct 5 04:08:28 2018 bla... ...bla TIMESTAMP=Fri Nov 2 14:11:28 2018 bla... ...bla TIMESTAMP=Tue Oct 16 17:10:00 2018 bla... ...bla TIMESTAMP=Fri Nov 2 14:11:03 2018 bla...
    Output:
    $ perl 1225106.pl ...bla TIMESTAMP=Fri Oct 05 00:08:28 2018 bla... ...bla TIMESTAMP=Tue Oct 16 13:10:00 2018 bla... ...bla TIMESTAMP=Fri Nov 02 10:11:03 2018 bla... ...bla TIMESTAMP=Fri Nov 02 10:11:28 2018 bla...

    Hope this helps!



    The way forward always starts with a minimal test.
      Thanks for the updated code, very helpful! Question - how do I tell it to sort the file? Do I still need to split my input into an array?

        Are there multiple records in one log like this ?

        <13>{Mangled Date/Time} <source>[pid]: TIMESTAMP={I'm using this timestamp} MSGCLS= Title= Severity= message = <message part a> <message part b> ... Message Id= END OF REPORT <14>{Mangled Date/Time} <source>[pid]: TIMESTAMP=Tue Oct 16 17:10:00 2018 MSGCLS= Title= Severity= message = <message part a> <message part b> ... Message Id= END OF REPORT
        poj
      Thanks, I'll have a look!
Re: Merge 2 or more logs, sort by date & time
by Laurent_R (Canon) on Nov 02, 2018 at 22:43 UTC
    Hi ImJustAFriend,

    presumably, your individual log files are already sorted in chronological order.

    In this case (and assuming you have a lot of data), merging sorted files by dates is usually much faster than sorting a concatenation of all the records. This idea is even the basic principle of the merge sort algorithm, one of the best sort algorithms available.

    Basically, assuming you have only two input files, you read them in parallel and pick from either the next candidate, and do it again. Each record is processed only once, and this is is way faster than any standard sort algorithm.

    If you have more than two input files, it becomes slightly more complicated, but it is not so complex to merge files two by two until they have all been used.

Re: Merge 2 or more logs, sort by date & time
by ImJustAFriend (Scribe) on Nov 02, 2018 at 17:45 UTC
    Thank you everyone for the help so far. Unfortunately, it's not working properly. Here is my code now:
    my $log = $provoutfile; print "Now Processing $log....\n"; open LOG, "$log" or die "LOG $log: $!\n"; $linecnt = 0; while (<LOG>) { chomp $_; $linecnt++; my @sortedlines = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, Time::Piece->strptime($_, '%a %b %d %H:%M:%S %Y')-> +epoch ] } @lines; $line = ""; $linecnt = 0; }
    It doesn't fail or error or anything... it just doesn't sort properly. When I grep TIMESTAMP out of my output file, the last dozen-ish lines look like this:
    TIMESTAMP=Fri Nov 2 16:33:18 2018 TIMESTAMP=Fri Nov 2 17:26:35 2018 TIMESTAMP=Fri Nov 2 17:26:35 2018 TIMESTAMP=Fri Nov 2 17:27:33 2018 TIMESTAMP=Fri Nov 2 17:27:33 2018 TIMESTAMP=Thu Oct 18 18:29:15 2018 TIMESTAMP=Thu Oct 18 18:29:15 2018 TIMESTAMP=Thu Oct 18 18:29:27 2018 TIMESTAMP=Thu Oct 18 18:29:27 2018 TIMESTAMP=Thu Oct 18 19:06:32 2018 TIMESTAMP=Thu Oct 18 19:06:32 2018
    Maybe it's because each individual log entry is broken up by newlines?
Re: Merge 2 or more logs, sort by date & time
by ImJustAFriend (Scribe) on Nov 02, 2018 at 18:38 UTC
    Thanks one and all for your generous help! I was able to adapt poj's answer to work. Thanks again!!