Hello all,

I have a log parsing problem, and I seek suggestions as to a reasonable Perlish solution -- I'm not really looking for any code per se, just algorithmic "advice".

First, I'll address the things that are given and cannot be changed.

Each day, I collect a dump from a logfile generator, which is the accumulation of all log entries since the beginning of that month. Each day, a new file is collected, and is theoretically at least as big as the previous day's file. I do not have the ability to directly control this "logfile source", so I must deal with the cumulative nature of the resulting files.

Occasionally, through magic processes that I also have no control over, there may a purging of the "logfile source", which consequently causes the next day's cumulative file size to restart from 0 bytes, and then contain only what was collected after the purge.

My program must "reconstruct" all of the unique log entries for the given month for a given server.

Assumptions:

  1. A log entry is discrete, and can be uniquely identified by its timestamp (and potentially other key data if necessary)
  2. Time and space are of minimal consequence, although I'd like this to run in a reasonable amount of time
  3. There are 10 servers, each which will generate one log file per day of month
  4. There are approximately 5000 log entries available at the end of each month for a server
  5. Each log entry is approximately 250 chars in length

In summary, there will be around 310 files, each having size somewhat over 1.2 MB -- nothing major. Each server will have its logs unique-ified into its own file.

Certainly, in Unix, I could do something like:

1) Concatenate files into a single file 2) Then do: `sort -u <concatfile> > <sortedfile>`

... but I suspect this will eventually live on a Windows box.

I thought that maybe I could do an MD5 digest for each log entry, and then use that as a hash key for subsequent collision checks (ie. ignore all subsequent redundancies).

Thoughts?

Where do you want *them* to go today?

In reply to Logfile parsing across redundant files by thezip

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.