Since you want your stats by server, the first things I would do is split the input file into separate files by server. That's little more than a one-liner, or use the grep utility.

Now you're dealing with a small portion of the data in each pass of the stats script, and each file will be sorted by epoch, so you avoid the need to sort. Max and min epoch for a server are just the last and first records of that servers file.

Splitting 6 million records into say 10 files by server takes around 3 or 4 minutes. Sorting a 600,000 key hash would take much, much, (*much*) longer than this.

You don't say what it is you are summarising so I can't talk to that, but it is quite likely that you can generate your stats from each of the individual server files in a single pass without the need to store every record in ram.

Two passes over the data, but avoiding building a huge and complex in-memory data structure, and avoiding the sort more than compensates for that. Processing 6e7 records twice shouldn't take more than 10 or 15 minutes on even a fairly modestly spec'ed machine.

Your use of heaps here looks very suspect to me.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re: Design Approach (Processing Huge Hash) by BrowserUk
in thread Design Approach (Processing Huge Hash) by mkirank

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.