What defines the choice of a Hash or an Array?

ghenry has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: What defines the choice of a Hash or an Array?
by mattr (Curate) on Mar 31, 2005 at 08:42 UTC

An array is like a stack of blocks. You can add on or pull off a block from the top or bottom of the stack, and you can pull out say the 15th block in the stack. An empty array is like a spot on a table where you can pile blocks. Imagine that each block is transparent with something (a number, a letter, a word, a hyperlink to some other structure..) inside. But the blocks are opaque on the sides, you have to pull one out and open the cover to see what's in a block.

So you can easily pick out a block if you know its height from the table (the bottom block is 0 blocks away, so it is index # 0). Arrays are also called lists, so they are good for storing things ("values") if you have a whole list of them. So an array is a sequential list of values that you can access according to their position in the sequence.

Hashes are different from arrays. If you think about an array as a stack of blocks with opaque sides, you realize that even if you have a 100 block tall stack you can't see what's inside at a glance from the side. You have to count to the 15th block and pull it out. (Well perl will do the counting but you have to specify a block number, or else pop off the top 85 blocks.) So an array, like a stack of opaque blocks, is really a sequential thing that you access one block at a time usually.

A hash isn't like that. It's more like having everything transparent. We are talking random access, not sequential. In fact, there is no stack. Let’s make a hash called %trees. Imagine you are sitting on the floor with your blocks scattered all over the place in front of you. Each block has a unique name or number (usually we like to use English words) printed on it in big easy to read letters, each block associated with its own label. So there is a block called "maple" and inside that block maybe it has the number "50" to describe the height of the tree. Another block is called "sequoia" and inside that block is the number "200" for the height of a sequoia tree.

Say you have a hundred blocks scattered in front of you, all associating the name of a tree with its height. How long does it take to find how tall a dogwood tree is? About no time at all. You can see the keyword "Dogwood" (we call it the "key" of the block) standing out right in front of you, and immediately pick up the Dogwood block and retrieve the value "25" from inside it. Ahah, a Dogwood tree is 25 feet high. So it is really easy to keep a huge amount of data in an easily retrievable way, just assign an easily remembered keyword to it and toss it on the floor (into the hash). You can do this for ten or ten thousand items, or more, it doesn’t matter, you can always pick up the right block if you know its key right away.

If you were using an array (a "stack of blocks") it would be much harder to do this. First of all, a block in an array doesn’t have a keyword, to specify one in an array you have to specify how high it is off the ground (its index, or number in the sequence as you count up from the ground starting at zero). Say you want to store information about a Pine tree being 75 feet high in an array. One easy perlish way of doing it would be to jam the name of the tree and its height together into a single string of letters, separated by a symbol (say we use a colon character as the delimiter). So the string for this tree is "pine:75" which is the value we will stuff into a block we push onto the stack of blocks, our array. We say push(@trees, join(":","pine","75")) to do so. But if we want to look up the height of a given tree in a tall array, we have to go looking inside each block one at a time and see if the value inside each block matches the name of the tree we want (using a regular expression to match the head of the string maybe). All of this trouble is eliminated by using a hash.

An array of course is better than a hash if you have two apple trees of different heights. Because each block in a hash has to have a unique keyword attached to it. If you change the name slightly (say "apple1" and "apple 2") a hash is still useable, but for a list of heights of say twenty different apple trees, an array is better. In this case we know they are apple trees and we can just put their height into the block for that number apple tree.

I hope this helps you a little bit in visualizing the difference between an array (a stack of opaque blocks) and a hash (blocks scattered in front of you on the floor). I use hashes often to keep track of a bunch of things at once, and you might like to try the exists and defined functions for hash elements too. You can easily see if a pear tree exists in the hash %trees by saying: if (exists $trees{pear}) { print "Yes\n";} If you read more of the perl manuals on data structures (see the PM library) you will discover that you can stuff arrays into hashes and vice versa, making all kinds of complex structures. So the best answer to our tree survey would be to make a hash of lists (a HoL, where you stuff an array into each hash element), so that for the hash key "apple" we can have an ordered list of heights for all apple trees, like $trees{apple}[0] = 20, $trees{apple}[1] = 22, etc.

Finally you might like to consider the reason why these structures are called hashes and arrays in the first place. An array really can be more than just a one-dimensional stack of blocks. It can have many dimensions, so a three dimensional array would be like a solid cube full of blocks. You would still have to select a block by its index along each dimension (think its x,y,z coordinates) like $block[2][5][15]. So an array has elements neatly arrayed by number.

A hash is a term that talks about a kind of mathematical function that distributes data well, and you could imagine that in order to instantly access any block out of say a hundred thousand scattered all over the floor, you are going to need a really good way of distributing that information. In fact a hash in perl stores data in something like buckets hanging from tree limbs and the speed with which you can get data out of a bucket has to do with making sure that you don't have too many branches, or too much data in a bucket. In fact the keyword you pick is mathematically converted (or "encrypted" or "hashed") so well that you can pick a million different keywords and be very sure they will not collide with each other after being hashed.