in reply to Design Approach (Processing Huge Hash)

Since you want your stats by server, the first things I would do is split the input file into separate files by server. That's little more than a one-liner, or use the grep utility.

Now you're dealing with a small portion of the data in each pass of the stats script, and each file will be sorted by epoch, so you avoid the need to sort. Max and min epoch for a server are just the last and first records of that servers file.

Splitting 6 million records into say 10 files by server takes around 3 or 4 minutes. Sorting a 600,000 key hash would take much, much, (*much*) longer than this.

You don't say what it is you are summarising so I can't talk to that, but it is quite likely that you can generate your stats from each of the individual server files in a single pass without the need to store every record in ram.

Two passes over the data, but avoiding building a huge and complex in-memory data structure, and avoiding the sort more than compensates for that. Processing 6e7 records twice shouldn't take more than 10 or 15 minutes on even a fairly modestly spec'ed machine.

Your use of heaps here looks very suspect to me.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
  • Comment on Re: Design Approach (Processing Huge Hash)