Re^4: better union of sets algorithm?

IMO what you are seeing there is NOT lookup time, but rather effects such as cache spoil and hash remapping. Every time the densitiy of the hash goes above a certain level the number of buckets in the hash are doubled and all of the elements are remapped. This has a cost that shows up in the type of benchamrk that you are doing. You would see most of this effect disappear by presizing the hash before loading it. The reduction in run time that is the consequence of cache spoil is basically impossible to eliminate. Basically once the hash gets over a certain size you are more or less guaranteed that a hash lookup will spoil cache, eliminating benefits that would be had were you able to store the whole hash inside of the cache.

Also its worth remembering that every time you use a unique key in _any_ hash _anywhere_ that key will be inserted into a second hash structure that is used to store the keys. This means that the second hash must undergo scaling as well. AFAIK, amortized the cost of hash storage or lookup is actually constant. BTW, I believe later perls incorporate a mechanism by which degenerate bucket mappings are avoided, from what I understand its unusual to get more than two elements to a bucket and more commonly you wont see more than one.

---
demerphq

Comment on Re^4: better union of sets algorithm?

Replies are listed 'Best First'.
Re^5: better union of sets algorithm? by Anonymous Monk on Mar 11, 2005 at 10:17 UTC
Also its worth remembering that every time you use a unique key in _any_ hash _anywhere_ that key will be inserted into a second hash structure that is used to store the keys. No, buckets are linked lists, not secondary hashes.	[reply]
Re^6: better union of sets algorithm? by demerphq (Chancellor) on Mar 11, 2005 at 10:30 UTC
By default build perl shares keys accross all hashes. That way when you have a zillion hashes with the same keys the keys arent duplicated in memory. To store all those keys it uses another hash that really holds the keys. So when you insert a new key into your hash its hash number is calculated and then that number is used to store into both hashes. Which means that hash has to grow as well. --- demerphq	[reply]