Iterating hashes safely and efficiently

Having spent the best part of an hour chasing down a bug with a hash loop, I feel the need to share my concluding thoughts. Imagine the following fairly innocent sounding beginner question:

I have a hash, and I want a loop that processes every element. What's the best way to do it?

Being experienced, this may prompt one to ask some further questions, such as "Do the elements need to be processed in key order?" and "Is there anything special about the hash, such as is it a tied hash?" - at which point the novice's eyes glaze over at the mention of tied hashes.

The most efficient way to do what is asked is as follows: (benchmarking is left as an exercise for the reader :)


while (my ($key, $value) = each %my_hash) {
   .
   . # some processing on the element
   .
}
[download]

The reason why this is efficient, is that we are using the 'each' function as an iterator. This means that, rather than having to collate a list of all hash members, then process each one, instead, successive calls to 'each' return another member, until the hash is exhausted.

Where can this go wrong?

It is all very well if the code inside the loop is simple. Consider what happens when you reference the hash (either the whole hash or a member of the hash) inside the loop. This has the effect of resetting the placeholder used by 'each'. Disaster: the loop loops forever over the same element!

This can happen several subroutine calls deep, and can be difficult to track down. In my case, I gave up tracking the bug through all the nested calls, and instead rewrote the loop as follows:

for my $key (keys %my_hash) {
   my $value = $my_hash{$key}    # and no need to change the code belo
+w
   .
   . # some processing on the element
   .
}
[download]

This has fixed my bug, and the code works as desired. However, it occurs to me that there is a circumstance when this approach will also fail. This is when the code is insering and/or deleting members from the hash.

Here is a completely safe way to have a loop that involves deletions and/or insertions, without skipping members or losing track of where you are. This code iterates over the original set of hash keys:

for my $key (my @temp = keys %my_hash) {
   my $value = $my_hash{$key}    # and no need to change the code belo
+w
   .
   . # some processing on the element
   .
}
[download]

Since the set of keys is now held in the array @temp, it is now safe to iterate over these values despite the changes in the actual hash keys.

Conclusion:

If you know everything you are doing inside the loop will not interfere with the hash, the 'each' approach is best. Otherwise, if you are not changing keys, use the middle approach. Finally save the keys into a temporary array if you know they could be changing.

Comment on Iterating hashes safely and efficiently Select or Download Code

Replies are listed 'Best First'.
Re: Iterating hashes safely and efficiently by Abigail-II (Bishop) on Aug 11, 2003 at 15:08 UTC
Saving the keys in a temporary array could be a very bad thing to do. It's ok if the hash is small, but consider the effects if the hash contains a ten million entries and is tied through some DB* mechanism. Your program suddenly needs a few additional hundreds of megabytes of memory. At best, that makes your program slow, or your program crashes. In the worst case, it makes every other process running on the box slow. Abigail	[reply]
Re: Iterating hashes safely and efficiently by sauoq (Abbot) on Aug 12, 2003 at 00:04 UTC
This code iterates over the original set of hash keys: `for my $key (my @temp = keys %my_hash)` [download] I'm confused as to why you think that buys you anything over `for my $key (keys %my_hash)` which, afaik, will always iterate over the original set of keys as well. Keys added in the loop won't be iterated over because the `keys()` function is only called once and you are iterating over the list it returns. Right? (I'll be mighty embarrassed if not.) As for deletions, the temporary array doesn't help you there either; that's what `exists()` is for. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Iterating hashes safely and efficiently by jarich (Curate) on Aug 12, 2003 at 08:06 UTC
Here is a completely safe way to have a loop that involves deletions and/or insertions, without skipping members or losing track of where you are. This code iterates over the original set of hash keys: `for my $key (my @temp = keys %my_hash) { my $value = $my_hash{$key} # and no need to change the code below . . # some processing on the element . }` [download] When I see someone doing this, I begin to wonder whether they've really thought about what they're trying to do. Deleting or inserting elements into the hash (or array) you're iterating over in foreach OUGHT to cause you problems! What do you do when you get to the element that you've since deleted? Why wouldn't you want to perform the operation on the element you've just inserted? I understand why you might use each and I understand the problems you can have with that, but I feel that if you can't perform your actions by using foreach and keys, then you should be refactoring your loop. Perhaps what you really want to be doing is having 2 loops, but you can probably just get by with one. For example: `my %new; foreach my $key (keys %old} { if(....) { $new{$key} = $old{$key}; # do further processing. } else { print STDERR "dropped key: $key\n"; } }` [download] I can't think of a good reason that a good programmer would mess around with the list they're iterating over unless they felt that they just had to avoid extra variables, and that isn't really a good reason. All the best, jarich	[reply] [d/l] [select]
Re: Iterating hashes safely and efficiently by adrianh (Chancellor) on Aug 12, 2003 at 11:27 UTC
As saouq correctly points out there is no difference between: `for my $key (keys %my_hash) {` [download] and `for my $key (my @temp = keys %my_hash) {` [download] Both create a temporary list. Both will be unaffected by changes to `%my_hash`. If you want to make sure you cover keys added in the loop processing, and skip keys deleted by the loop processing, then you'll need to do something messy like: `my %hash = (a => 1, b => 2); my %seen_key = (); my $seen_key_count; do { $seen_key_count = keys %seen_key; foreach my $key (keys %hash) { # skip any entries deleted since we called keys next unless exists $hash{$key}; # skip any keys we have already seen next if $seen_key{$key}++; my $value = $hash{$key}; # do stuff with $value print "$key = $value\n"; # and mangle the hash $hash{c}="foo"; delete $hash{b}; }; } until ($seen_key_count == keys %seen_key);` [download] But I'd agree with jarich and dragonchild. This smells like a design problem. If I needed to code anything like this I would have a very strong suspicion that there was something seriously wrong somewhere in the codebase. Rather than fix the symptom I would try and track down and eliminate the disease. Could you give more of a description why this is necessary? Perhaps we could suggest an alternative solution.	[reply] [d/l] [select]
Re: Iterating hashes safely and efficiently by dragonchild (Archbishop) on Aug 11, 2003 at 14:21 UTC
Just out of curiousity, why would you have an iterator over a hash, then pass that hash into a function to iterate over again? That sounds like a poor design, to me ... ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Re: Iterating hashes safely and efficiently by rinceWind (Monsignor) on Aug 11, 2003 at 14:27 UTC
To satisfy your curiousity here, we are talking about a Tk application. The hash in question is mainline scoped, and the flow of execution is quite tortuous. I know it may be possible to redesign the app to use objects passed in by reference to the callbacks, but this would be a substantial piece of work.	[reply]