in reply to Big hairy ugly log sorting merging problem

The individual log files by themselves are sorted, right? That's a classic merge sort situation.

  1. Set up a buffer for one element per input log
  2. Pull a line from each log into its buffer
  3. Compare the sort keys of all non-empty buffers
  4. Flush the buffer with the smallest key to the target file
  5. Refill that buffer from its file if there's more data in the file
  6. Repeat step 3 onward if there are non-empty buffers

You only need enough memory for as many lines as you have input files.

Update: forgot to answer the point about dupes, d'oh. If you're careful about which buffer to pick when there are ties in step 4, you can cluster dupes at that point. In your case, since the individual files will not contain dupes, but entries might be duplicated across files, you want to favour the buffer that was flushed the longest ago. That way, you will step through the files in synch if you're in sections containing identical data.

Makeshifts last the longest.

  • Comment on Re: Big hairy ugly log sorting merging problem

Replies are listed 'Best First'.
Re^2: Big hairy ugly log sorting merging problem
by mr. jaggers (Sexton) on Aug 07, 2004 at 01:39 UTC

    *sigh*

    Yep, I realized not too long after posting that straight sorting would be folly... Log entries that occur during the same second shouldn't necessarily be sorted by anything; they should simply preserve whatever order they had in their original files.

    Apon realizing this, I started basically the above, hoping that my inherent laziness would be fed by someone elses' apache log merging script ("Here ya go, little guy!").

    If I finish before an entire script magically appears in the comments below, than if someone wants it, just reply your intent.

    That way, the saints (like merlyn) can point and laugh, too! :)

    As a side note, wouldn't it have been cool to have mod_perl spin those logs off to the database server, into their own tables? Maybe my boss will pay me to write that...

      Actually, there are modules (in the Apache as well as Perl sense of the word) to dump logs to a database instead of a file.

      Depending on the amount of traffic you get, using them may or may not be wise.

      Makeshifts last the longest.

        Well, hundreds of megs per month for several virtual hosts... I think this sounds like more of a gut-feeling hardware requirement issue. I know that *I* wouldn't trust my off-the-cuff guess.

        I suppose I could profile individual insertions and make a judgement based on hit rate...

        Has anyone seen any performance numbers or hardware suggestions?