in reply to Re: Re: Re: Slowness when inserting into pre-extended array
in thread Slowness when inserting into pre-extended array

If there was a trap, it was the trap of not reading the question correctly and indeed zeroing in on the limit mentioned when I shouldn't have. I don't want to raise any myths. Which is why I'm glad you're correcting me.

And for sounding authoritative. I'm primarily older then many of my fellow Perl monks, but definitely not more experienced in the use of Perl. So:

Hashes are OK for any number of keys until you run out of memory Even if this wouldn't be the case, is there a better way to do it (without resorting to something like Judy)?

However, I think you're wrong with respect to point 3. On the one hand you're saying that the number of keys per bucket stays in the same range. Then how can there be a "worst case" scenario? If Perl would be able to always keep the number of keys per bucket roughly the same, how could there be a worst case scenario?

Liz

  • Comment on Re: Re: Re: Re: Slowness when inserting into pre-extended array

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Slowness when inserting into pre-extended array
by tilly (Archbishop) on Jul 19, 2003 at 23:50 UTC
    The aspect of authoritative that I was talking about was your giving enough technical information which is hard for others to judge that it looked like you must know a lot about what you are talking about. It is independent of age. Of course I am doing the same - throwing out lots of technical information - but hopefully I actually understand it and will be able to answer any follow-up questions until you are comfortable that you understand it as well.

    As for what is better than hashing, define better. For key/value pairs with a flat memory model, hashing gives the best average performance of any simple data structure that I know of. That is why people use them. If an obviously superior alternative existed, then everyone would use that and nobody would have heard of hashing. But hashing has bad worst-case performance, doesn't keep things in order (often a requirement), and blows your cache badly. If you care about worst-case performance, having things come out in order, or if memory cache effects matter to you, then alternatives can beat hashing. (My understanding from their docs is that Judy arrays can beat hashes because of doing a lot of work to pay attention to memory cache effects. Similarly for large data structures backed on disk a BTREE beats hashing because you hit disk far less.)

    And finally I assure you that I am not wrong about point 3. A hashing algorithm that is working well appears to throw keys into buckets randomly. The resulting distribution of numbers of keys per bucket is a Poisson distribution, with lambda depending on the ratio of numbers of keys to buckets. If you add too many keys, Perl increases the number of buckets so that lambda stays in a fixed range.

    The fly in the ointment is the phrase, appears to throw keys into buckets randomly. As we all know, computers don't do "random" very well, and besides which, when we see the same key later we had better go into the right bucket. There is therefore no way to guarantee that you won't have lots of keys in one bucket, and the rest of the buckets empty. The odds against this happening by accident are astronomical, but it is not impossible.

    And that is the worst case scenario - that the hashing algorithm throws all keys into one bucket. Which happens (almost) never by accident, but which a malicious attacker can try to make happen intentionally. And if someone does that, then it does Perl absolutely no good to provide the hash with a million buckets to use - everything is going into just one and the well-designed hash implementation degrades into scanning a huge linked list.

    Hence the defence of randomizing the hashing algorithm itself so that nobody can predict what hashing algorithm they are dealing with, and therefore what set of keys will give it problems...

Re: Re: Re: Re: Re: Slowness when inserting into pre-extended array
by Elian (Parson) on Jul 20, 2003 at 15:18 UTC
    Hashes are OK for any number of keys until you run out of memory
    That's not actually universally true--large numbers of hash keys trigger pathological behaviour on many versions of glibc. The bug's been fixed, but a lot of those systems are still out there because people are understandably loathe to update such a fundamental part of their system.