Using a hash for an ordered set can make sense

Beginners are understimating the hashes utility if not ignoring their mere existence. Exprimented users know when to choose an hash or an array. I am trying here to formalize the choice process.

We have a set of elements. Should we shove it in an array or a hash? If the set is unordered, it is clear that the choice is to use a hash:
$hash{$_}++ for qw( one_element another still_another foo foo );
If we have duplicate elements we get almost for free as associated values the number of duplicates for a given element. Note that we don't have lost the ability to iterate over the set. In the following iteration, duplicates are made consecutive.
print "$_\n" for map { my $i=$_; map { $i } 1.. $hash{$_} } keys %hash

If the set is ordered, it may be a stack or a queue; or a static list which elements are accessed by rank. In Perl all of those are stored in arrays. We don't have to remember fancy classes names here unlike the OO bondage languages.

Now the interesting case... and less intuitive. The set has no duplicates and is ordered by some (possibly costly) function of the elements values and we are adding and supressing elements all the time. Because it is ordered, we are tempted to use an array; but insertion and deletion in a an array are complex and costly (splices), We may think of a linked list, but really the natural choice is usually the hash (or 2 hashes). Let's call val the ordering function.

We start by the case of one hash
Note that if we don't have an explicit ordering function, the order is just the the elements appearance order: my $i=0; $hash{$_}=$i++ for qw( one_element another still_another foo );

Adding an element: $hash{$elt}= val $elt
Suppressing an element: delete $hash{$elt}
Iterating with order: print "$_\n" for sort { $hash{$a} <=> $hash{$b} } keys %hash;

Because we have a one to one correspondance between keys and values, we could use a an additional hash: a "reverse" hash %rev. It has the property: $hash{$rev{$_}} == $_ == $rev{$hash{$_}}
Iterating would be cheap: print "$rev{$_}\n" for sort keys %rev; But the price to pay is to maintain two hashes instead of one when adding or deleting elements. So depending if we spend more time iterating over element or adding and deleting elements, we should use respectively two or one hash. Space consideration may also a concern if using two hashes. In both cases, each iteration incurs the cost of a sort: n log2 n

So if adding and deleting an element is really the rare case compared to iterating, than you are better using an array and the joice of splicing :(

Sure enough, when we want to more operations than adding, deleting and iterating, richer data structures must be considered.

This reflexion was triggered by On Removing Elements from Arrays

Corrected to suppress the unneeded Schwartzian Transform as suggested below by merlyn below. The ordering function is already cached as value associated to key. The whole point of an ST is to avoid to recalculate that function over and over. The correction does not change the validity of the points made in that node. Thanks merlyn

-- stefp

Comment on Using a hash for an ordered set can make sense Select or Download Code

Replies are listed 'Best First'.
Re: Using a hash for an ordered set can make sense by merlyn (Sage) on Sep 23, 2001 at 09:03 UTC
`print "$_\n" ofr map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, $hash{$_} ] } keys %hash;` [download] This is an erroneous application of the ST. The ST should be used only when the ordering function is expensive for each element. Hash lookup is not expensive. I'd rewrite this as a simple defined sort: `print "$_\n" for sort { $hash{$a} <=> $hash{$b} } keys %hash;` [download] Easier to type, faster, and clearer. The trifecta! -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
Re (tilly) 1: Using a hash for an ordered set can make sense by tilly (Archbishop) on Sep 23, 2001 at 22:33 UTC
In my experience the majority of programs which run into performance problems on time spent inserting and deleting items into arrays can be more easily fixed by doing inserts and deletes as bulk operations, using grep or map when you can, and when you can't then incrementally building up a new array with repeated calls to push. While it looks like a lot of work to build a new array from scratch, it is a lot less work than it is to do a lot of manual manipulations of the array, each of which involve moving around large chunks of the existing array. This observation will allow you to fix the vast majority of performance problems arising from having to do a lot of detail manipulation of arrays. But there are a very small number of programs which simply must do a lot of inserting and deleting which cannot be fixed. The odds are very small that you have one of those. This is so even if you don't see offhand how to fix the program to avoid the issue. (It takes some experience to find the fixes.) But for that infinitesmal remainder, I would recommend that you tie the array to a data structure (eg DB_File's BTREE implementation) that has slower accessing but allows inserts and deletes to be done much faster. Voila! Works like magic! (In fact magic what does makes tie work behind the scenes...)	[reply]
Re: Using a hash for an ordered set can make sense by dragonchild (Archbishop) on Sep 24, 2001 at 17:14 UTC
I'd like to address the assertion that for forward iterating you need one hash and for reverse iteration, you need another. This seems to me like you would have a hash of hashes, built as such: `my %datastructure = ( forward_iteration => { }, reverse_iteration => { }, );` [download] This seems wasteful to me. Instead, why not have something like `my %datastructure = { elem1 => { forward => 1, reverse => 2, }, elem2 => { forward => 2, reverse => 1, }, };` [download] This way, if you want to change your code to add a new ordering mechanism (which happens more often that I'd like), you just add another value to each element's hashref of orderings. And, building the ordered-hash is more obvious, because you're doing something like `$datastructures{$elem} = { forward => forward_val($elem), reverse => reverse_val($elem), other_way => other_way_val($elem), }` [download] Just a thought. :-) ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l] [select]