Monks,

Many months ago, a workhorse web server here at work got split open like a sardine can by a clever baltic hacker. At any rate, backups were pulled a few days before the box was dropped.

Apache log files from the now-old backups got dropped onto the new web server, without merging in the few days of log accumulation. At any rate, most of the clients who were affected wanted that log activity merged back in for stats , etc... (insert sound of other shoe dropping)... and now it's *my* problem.

I attempted to use sdiff to extract the missing logs, but the two versions of the logs each contain unique entries. I need a combination of all unique lines of both files.

So, I used the mergelog tool, just to find out that it allows duplication. So, I run "uniq" on it, but then find out that mergelog merges chunks of log lines with the same timestamp. So two lines with stamp "[01/Jan/2004:00:07:55..." go in back-to-back, followed by the same pair of log lines from the other log file. So, unless the merged log file is somehow sorted by more than just the timestamp, uniq can't help.

Searching PM, I find this discussion between monks (which, I might add, has a valuable and satsifying suggestion to kick the fellow that handed passed this off on me) which offers two short, untested, and not-entirely-applicable sort routines. Ugh.

One idea of implementing a radix sort due to the large size of the log files is interesting, and I'm sort of between that and using some version of a Schwartzian Tranform right now; but I think that sorting is probably going to be the wrong way to go.

I've already hand merged the smaller virtual hosts, but some of these guys have 100-200MB monthly log files which is (at this point). It must be algorithmic. So, basically this problem just sucks.

Anyone got an idea how to do this for many hundred meg log files? I've already had two people tell me that perl, being interpreted, is too slow for a solution. Well, I've had excellent experiences munging multi gig data files with perl very rapidly, but I'm not willing to rule anything out. Any creative ideas between diff and patch? This should be a simple problem, but there doesn't seem to be enough coffee in my office for me to fuel a solution...


In reply to Big hairy ugly log sorting merging problem by mr. jaggers

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.