Re: datastructure of array and hash?

cool:

Hashes are stored randomly because they use a function to turn a key into a slot number. So when you do a read or store, it can be very quick: you just calculate the slot number from the key, and then get the data out of the specified slot. An array is faster still, as you already have the slot number in hand.

So, is an array faster? Yes, for certain things. If you have a dense collection of integer numeric keys, say even numbers from 100 to 200, then you can use a simple array to get your values: $val = $arr[($key-100)/2];. If your index values are sparse, then you can waste a good bit of memory to hold nothing, and hashes may be very attractive as your key range increases to a large size with very sparse data. A hash, on the other hand, need not be very large in this case.

Another case where arrays can be better is if you're always processing a set of items in order and you don't really care about random access. In that case, an array is fine, you just add stuff to your array and process it in order for your application. A hash won't help you here. This is the situation that your benchmark demonstrates: You get all the overhead of the hash (calculating the key), with none of the benefits.

Frequently, however, you have a text string that you want to use as a key. In that case, you could approach it with an array by having two values in each array slot (key and value), or you could have two arrays, one for the keys and one for the values. In the first case, you have to search half the array (on average) to find the value for a specified key. In the second case, you have to search half the key array (on average) to find the key, then you can directly look up the value from the second array with the same index. Either way, using the array makes you have to do a lot of lookups to get a particular value. For the hash you need only calculate the expected slot position from the key and check to see if it's there. One calculation and one lookup, resulting in *much* faster access. Here's an example where we search for random values:

#! /usr/bin/perl
use strict;

# Build the dataset
my $iterations = 100000;
my $range = 10000;
my $num_keys = 1000;

my %hash = map { "a_" . int(rand($range)) => 0 } (0..$num_keys);
my @array = sort keys %hash;

my $cnt_array=0;
my $cnt_hash=0;

# look for random array entries
my $start = time;
for (0..$iterations) {
        my $key = "a_" . int(rand($range));
        for (0..@array) {
                ++$cnt_array if $key eq $array[$_];
        }
}
my $ttl = time - $start;
print "Array search took $ttl seconds, found $cnt_array\n";

# look for random hash entries
$start = time;
for (0..$iterations) {
        my $key = "a_" . int(rand($range));
        ++$cnt_hash if defined $hash{$key};
}
$ttl = time - $start;
print "Hash search took $ttl seconds, found $cnt_hash\n";
[download]

When I ran this, I got the results:<\p>

$ ./foo.pl
Array search took 33 seconds, found 9534
Hash search took 0 seconds, found 9426
[download]

So hashes definitely have their use...

Note: I glossed over some of the details of a hash, so if you want to really get into the details you'll need to google a bit.

...roboticus

Update: I just ran it with 10 million iterations, and got:

$ ./foo.pl
Array search took 3258 seconds, found 955277
Hash search took 16 seconds, found 954543
[download]

So in this ~~boneheaded~~ example, the hash was about 200 times faster than an array search.

Comment on Re: datastructure of array and hash? Select or Download Code