Design Approach (Processing Huge Hash)

mkirank has asked for the wisdom of the Perl Monks concerning the following question:

Problem

multiple servers write to a log file the log file will have entries for several days for each day for each server summarize the data and print the top x elements

Logfile Format

servername,epoch time,Field1,Feild2,Field3,Field4 (since multiple servers are writing the epoch time is not in an order)
Below is the flow of the intended program (there may be some syntax errors)
1.The Logfile has around 5-6 million entries

2.Is their a better approach (like summarizing the data as we are reading it from the logfile , since it will take a lot of memory .
3. for summarization again we need to create a hash and take the top x elements of that hash



use Heap;

my %servers;
my $servername;

my ($epoch,$field1,$field2,$field3,$field4);

my $tmparref;
my $heap = 'heap';
open (LOG,"$file") or die "cannot open $!";
while (<LOG>) {
    ($servername,$epoch,$field1,$field2,$field3,$field4) = get_details
+($_);
    ## Create a reference to a array
    $tmparref = [$field1,$field2,$field3,$field4]

    if (exists($servers{$servername}{$epoch})) {
               push@{$servers{$servername}{$epoch}},$tmparref;
    } else {
               $servers{$servername}{$epoch}=[$tmparref];
           ## creat a object of heap dynamically at runtime

           ${$heap$servers{$servername}} = Heap->new;

    }
    ## for each server add the epoch time to a heap object
    ${$heap$servers{$servername}}->add( $elem );

    
}

close(LOG);
my ($process_server,$minday,$maxday);
foreach (keys (%servers) {
     $process_server = $servers{$_};
     ##
      Sort on the keys (epoch)
      get the minimum and maximum epoch for that server

      for (then for each day) {
          ### Here again we need to create a hash with the summarized 
+data
      ###
      summarize the data
  
      }
        print the record for top x (a constant)

}
[download]

Comment on Design Approach (Processing Huge Hash) Download Code

Replies are listed 'Best First'.
•Re: Design Approach (Processing Huge Hash) by merlyn (Sage) on Aug 26, 2004 at 13:13 UTC
1.The Logfile has around 5-6 million entries That's a sure sign that you're probably a lot better off with a real database. May I suggest starting off with DBD::SQLite, and then working your way up to PostgreSQL if that is insufficient? Then, it'll simply be a matter of writing a half-dozen lines of SQL, and your results will be quick and painless. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^2: Design Approach (Processing Huge Hash) by mkirank (Chaplain) on Aug 27, 2004 at 05:07 UTC
we do plan to use a database but the problem is that the script has to support different databases (postgres,oracle,Ms-sql). If we use a database , then inserting all the records will take much more time (we cannot do a bulk insert as we deal with different databases) ,so we came up with the idea of summarizing as this will reduce the number of inserts to the database As per your suggestion , what i can probably do is insert all the records into SQLLite (as this insertion will be faster than other databases) and then probably summarize the data through SQLlite and then insert that summarized data into postgres or other databases. Thanks for your comments .	[reply]
Re: Design Approach (Processing Huge Hash) by BrowserUk (Patriarch) on Aug 26, 2004 at 14:52 UTC
Since you want your stats by server, the first things I would do is split the input file into separate files by server. That's little more than a one-liner, or use the grep utility. Now you're dealing with a small portion of the data in each pass of the stats script, and each file will be sorted by epoch, so you avoid the need to sort. Max and min epoch for a server are just the last and first records of that servers file. Splitting 6 million records into say 10 files by server takes around 3 or 4 minutes. Sorting a 600,000 key hash would take much, much, (much) longer than this. You don't say what it is you are summarising so I can't talk to that, but it is quite likely that you can generate your stats from each of the individual server files in a single pass without the need to store every record in ram. Two passes over the data, but avoiding building a huge and complex in-memory data structure, and avoiding the sort more than compensates for that. Processing 6e7 records twice shouldn't take more than 10 or 15 minutes on even a fairly modestly spec'ed machine. Your use of heaps here looks very suspect to me. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re: Design Approach (Processing Huge Hash) by rsteinke (Scribe) on Aug 26, 2004 at 15:00 UTC
Storing all the lines in all the logfiles seems the wrong way to go here. Both of your objectives seem more parse-as-you-go. For printing the top x lines, simply read the top x lines from each file, then sort them and print the top x of the sort. You'll never need most of the lines in the file. For getting summary data, just keep running totals on the data you care about. For min and max values that's easy, but you can even keep running averages and such fairly easily. `my ($avg, $num_contrib) = (0, 0); foreach (<get a line>) { my $val = <something>; $avg += ($val - $avg) / ++$num_contrib; }` [download] Other kinds of running totals can be gotten by similar algorithms. Naturally, you'd want to combine all these things so you're only doing a single pass through each file, for efficiency. Ron Steinke <rsteinke@w-link.net>	[reply] [d/l]