Re: Hash of Arrays versus Two Hashes

Here is a measurement script for the two approaches for fairly large hashes (credits to the Anonymous Monk who submitted it in Re: Determining the Memory Usage of a Perl program from within Perl):

use strict;
use vars qw(%h1 %h2);

sub get_statm_info {
  my $error = '';
  my $ref = {};

  if( ! open(_INFO,"</proc/$_[0]/statm") ){
    $error = "Couldn't open /proc/$_[0]/statm [$!]";
    return $ref;
  }
  my @info = split(/\s+/,<_INFO>);
  close(_INFO);

  ### these are all the props (skip some)
  # size resident shared trs lrs drs dt
  ###

  ### get the important ones
  $ref = {size     => $info[0] * 4,
          resident => $info[1] * 4,
          shared   => $info[2] * 4};

  return $ref;
}

# Double hash case.
%h1 = map { $_, 16 } 1..10000;
%h2 = map { $_, 20 } 1..10000;
# Single hash of arrays.
#%h1 = map { $_, [16,20] } 1..10000;
my $ref = get_statm_info($$);
print $ref->{size},"\n";
print $ref->{resident},"\n";
print $ref->{shared},"\n";
[download]

I measured 4352k for the two-hash case and 4944k for the single hash of arrays. It makes sense to me that the hash of arrays costs more because of all those extra references.

What data structure makes sense for a search engine? I'm not sure either of these do, if you are scoring hundreds of thousands of pages. You don't want to put all those scores and page references in memory and then sort them.

What might make sense is a heap, in which you can keep the top N best scores partially sorted with a logarithmic insertion time. There's a Heap module for a starting point.

Comment on Re: Hash of Arrays versus Two Hashes Download Code