About the expected number of comparisons for n keys into k buckets

PerlOnTheWay has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: About the expected number of comparisons for n keys into k buckets by ikegami (Patriarch) on Jul 28, 2011 at 02:44 UTC
I get the same expression you do. `For one bucket of x elements, 1st requires 1 comparison 2nd requires 2 comparisons 3rd requires 3 comparisons ... xth requires x comparisons Total comparisons = x + x-1 + ... + 2 + 1 = x(x+1)/2 If there are k buckets of n/k elements, Total comparisons = k * (n/k) * ((n/k)+1) / 2 = n(n + k)/2k` [download]	[reply] [d/l]
Re^2: About the expected number of comparisons for n keys into k buckets by PerlOnTheWay (Monk) on Jul 28, 2011 at 02:52 UTC
Oh, I spot what's wrong here,it should be: X1(X1+1)/2 + ... Xk(1+Xk)/2 = n/2 + (X1^2+X2^2+...+Xk^2)/2 Let (1) denote the above expression. Now I've no idea how to calculate the expected value of (1)....	[reply]
Re: About the expected number of comparisons for n keys into k buckets by Marshall (Canon) on Jul 28, 2011 at 03:59 UTC
Interesting. I think there is a fundamental problem here arising from "For a random hash of "<n"> keys into "<k"> buckets,..". Presuming we can even define what random input means to the hash algorithm, the hash function will not generate random indicies. There is a lot of tuning and just plain raw running a lot of what are supposed to be realistic test cases involved in picking a hash algorithm. To the best of my knowledge, the Perl hash algorithm is: `unsigned int hash = 0; while (s) / s is pointer to string / hash = hash 33 + *s++; return hash;` [download] As it turns out, this is a very efficient algorithm with Intel processors to calculate (which is definitely part of any quality benchmark). If you try to replace that integer multiply with some shifts, the performance will just dive into the dirt. That's because on a modern Intel processor an integer multiply is essentially the same as an integer add. I was stunned when I found that out. I have a C app where I changed the hash key generation algorithm from the Perl one to a different one based upon empirical test data. Hash algorithms appear to be very much an empirical art. Anyway, there are a lot of variables. I'm skeptical, a single Q, Quality factor can be derived for a particular algorithm independent of a bunch of assumptions about the input data. Update: From what I remember, the Perl hash algorithm will double the number of hash indicies when the total number of things in the hash exceeds total number of "starting slots/2". There is an errata to "Advanced Perl Programming, page 340 that points this out. This "when do we get bigger?" is also very important!	[reply] [d/l]
Re^2: About the expected number of comparisons for n keys into k buckets by JavaFan (Canon) on Jul 28, 2011 at 09:24 UTC
To the best of my knowledge, the Perl hash algorithm is: `unsigned int hash = 0; while (s) / s is pointer to string / hash = hash 33 + s++; return hash;` [download] Except that it isn't anymore. The above was the case for 5.6., but it has changed in 5.8.0, and then again in 5.8.1. Since 5.8, it is doing the shifts you seem to dislike. From hv.h: /* hash a key / / FYI: This is the "One-at-a-Time" algorithm by Bob Jenkins * from requirements by Colin Plumb. * (http://burtleburtle.net/bob/hash/doobs.html) / / The use of a temporary pointer and the casting games * is needed to serve the dual purposes of * (a) the hashed data being interpreted as "unsigned char" (new since + 5.8, * a "char" can be either signed or unsigned, depending on the com +piler) * (b) catering for old code that uses a "char" * * The "hash seed" feature was added in Perl 5.8.1 to perturb the resu +lts * to avoid "algorithmic complexity attacks". * * If USE_HASH_SEED is defined, hash randomisation is done by default * If USE_HASH_SEED_EXPLICIT is defined, hash randomisation is done * only if the environment variable PERL_HASH_SEED is set. * For maximal control, one can define PERL_HASH_SEED. * (see also perl.c:perl_parse()). / #ifndef PERL_HASH_SEED # if defined(USE_HASH_SEED) \|\| defined(USE_HASH_SEED_EXPLICIT) # define PERL_HASH_SEED PL_hash_seed # else # define PERL_HASH_SEED 0 # endif #endif #define PERL_HASH(hash,str,len) PERL_HASH_INTERNAL_(hash,str,len,0) / Only hv.c and mod_perl should be doing this. / #ifdef PERL_HASH_INTERNAL_ACCESS #define PERL_HASH_INTERNAL(hash,str,len) PERL_HASH_INTERNAL_(hash,str, +len,1) #endif / Common base for PERL_HASH and PERL_HASH_INTERNAL that parameterises * the source of the seed. Not for direct use outside of hv.c. / #define PERL_HASH_INTERNAL_(hash,str,len,internal) \ STMT_START { \ register const char const s_PeRlHaSh_tmp = str; \ register const unsigned char s_PeRlHaSh = (const unsigned cha +r )s_PeRl HaSh_tmp; \ register I32 i_PeRlHaSh = len; \ register U32 hash_PeRlHaSh = (internal ? PL_rehash_seed : PERL +_HASH_SEED ); \ while (i_PeRlHaSh--) { \ hash_PeRlHaSh += *s_PeRlHaSh++; \ hash_PeRlHaSh += (hash_PeRlHaSh << 10); \ hash_PeRlHaSh ^= (hash_PeRlHaSh >> 6); \ } \ hash_PeRlHaSh += (hash_PeRlHaSh << 3); \ hash_PeRlHaSh ^= (hash_PeRlHaSh >> 11); \ (hash) = (hash_PeRlHaSh + (hash_PeRlHaSh << 15)); \ } STMT_END [download]	[reply] [d/l] [select]
Re^2: About the expected number of comparisons for n keys into k buckets by ikegami (Patriarch) on Jul 28, 2011 at 04:33 UTC
I think there is a fundamental problem here arising from "For a random hash of "<n"> keys into "<k"> buckets,..". Presuming we can even define what random input means to the hash algorithm, the hash function will not generate random indicies. I'll grant you the terminology is poor, but I don't see how they could possibly be referring to anything but an optimially balanced hash, one with with the same number of elements in each bucket. It the formula was really the result of a sampling of random hashes, the formula wouldn't have been derived from "the sum of the squares of the number of entries in each bucket."	[reply]
Re^3: About the expected number of comparisons for n keys into k buckets by Marshall (Canon) on Jul 28, 2011 at 05:10 UTC
I think I understand you and I don't see any big problem.	[reply]
Re^3: About the expected number of comparisons for n keys into k buckets by PerlOnTheWay (Monk) on Jul 29, 2011 at 09:01 UTC
It turns out the perldoc is correct, a friend of mine gets the same result by rigorous mathematical calculation.	[reply]
Re^4: About the expected number of comparisons for n keys into k buckets by ikegami (Patriarch) on Jul 29, 2011 at 15:11 UTC
Re^2: About the expected number of comparisons for n keys into k buckets by PerlOnTheWay (Monk) on Jul 28, 2011 at 04:19 UTC
According to the definition of quality, I think it's not related with the exact implementation of hash: The "quality" of a hash is defined as the total number of comparisons needed to access every element once, relative to the expected number needed for a random hash. The value can go over 100%.	[reply]