mr. jaggers has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

Many months ago, a workhorse web server here at work got split open like a sardine can by a clever baltic hacker. At any rate, backups were pulled a few days before the box was dropped.

Apache log files from the now-old backups got dropped onto the new web server, without merging in the few days of log accumulation. At any rate, most of the clients who were affected wanted that log activity merged back in for stats , etc... (insert sound of other shoe dropping)... and now it's *my* problem.

I attempted to use sdiff to extract the missing logs, but the two versions of the logs each contain unique entries. I need a combination of all unique lines of both files.

So, I used the mergelog tool, just to find out that it allows duplication. So, I run "uniq" on it, but then find out that mergelog merges chunks of log lines with the same timestamp. So two lines with stamp "[01/Jan/2004:00:07:55..." go in back-to-back, followed by the same pair of log lines from the other log file. So, unless the merged log file is somehow sorted by more than just the timestamp, uniq can't help.

Searching PM, I find this discussion between monks (which, I might add, has a valuable and satsifying suggestion to kick the fellow that handed passed this off on me) which offers two short, untested, and not-entirely-applicable sort routines. Ugh.

One idea of implementing a radix sort due to the large size of the log files is interesting, and I'm sort of between that and using some version of a Schwartzian Tranform right now; but I think that sorting is probably going to be the wrong way to go.

I've already hand merged the smaller virtual hosts, but some of these guys have 100-200MB monthly log files which is (at this point). It must be algorithmic. So, basically this problem just sucks.

Anyone got an idea how to do this for many hundred meg log files? I've already had two people tell me that perl, being interpreted, is too slow for a solution. Well, I've had excellent experiences munging multi gig data files with perl very rapidly, but I'm not willing to rule anything out. Any creative ideas between diff and patch? This should be a simple problem, but there doesn't seem to be enough coffee in my office for me to fuel a solution...

  • Comment on Big hairy ugly log sorting merging problem

Replies are listed 'Best First'.
•Re: Big hairy ugly log sorting merging problem
by merlyn (Sage) on Aug 07, 2004 at 01:16 UTC

      Ok, good idea. I wasn't thinking along the lines of columns, but that's exactly what they are... plus MySQL and DBI, et. al, are already on the log-file-containing-machine.

      It's no postgres, but it should do for this, I think.

      I think the real sort problem is the resolution of the timestamp in our combined log format. Does anyone (that cares) happen to know if Apache will do finer grained log timestamping than single seconds?

Re: Big hairy ugly log sorting merging problem
by Aristotle (Chancellor) on Aug 07, 2004 at 01:21 UTC

    The individual log files by themselves are sorted, right? That's a classic merge sort situation.

    1. Set up a buffer for one element per input log
    2. Pull a line from each log into its buffer
    3. Compare the sort keys of all non-empty buffers
    4. Flush the buffer with the smallest key to the target file
    5. Refill that buffer from its file if there's more data in the file
    6. Repeat step 3 onward if there are non-empty buffers

    You only need enough memory for as many lines as you have input files.

    Update: forgot to answer the point about dupes, d'oh. If you're careful about which buffer to pick when there are ties in step 4, you can cluster dupes at that point. In your case, since the individual files will not contain dupes, but entries might be duplicated across files, you want to favour the buffer that was flushed the longest ago. That way, you will step through the files in synch if you're in sections containing identical data.

    Makeshifts last the longest.

      *sigh*

      Yep, I realized not too long after posting that straight sorting would be folly... Log entries that occur during the same second shouldn't necessarily be sorted by anything; they should simply preserve whatever order they had in their original files.

      Apon realizing this, I started basically the above, hoping that my inherent laziness would be fed by someone elses' apache log merging script ("Here ya go, little guy!").

      If I finish before an entire script magically appears in the comments below, than if someone wants it, just reply your intent.

      That way, the saints (like merlyn) can point and laugh, too! :)

      As a side note, wouldn't it have been cool to have mod_perl spin those logs off to the database server, into their own tables? Maybe my boss will pay me to write that...

        Actually, there are modules (in the Apache as well as Perl sense of the word) to dump logs to a database instead of a file.

        Depending on the amount of traffic you get, using them may or may not be wise.

        Makeshifts last the longest.