For one bucket of x elements,
1st requires 1 comparison
2nd requires 2 comparisons
3rd requires 3 comparisons
...
xth requires x comparisons
Total comparisons
= x + x-1 + ... + 2 + 1
= x(x+1)/2
If there are k buckets of n/k elements,
Total comparisons
= k * (n/k) * ((n/k)+1) / 2
= n(n + k)/2k
| [reply] [d/l] |
Oh, I spot what's wrong here,it should be:
X1(X1+1)/2 + ... Xk(1+Xk)/2 = n/2 + (X1^2+X2^2+...+Xk^2)/2
Let (1) denote the above expression.
Now I've no idea how to calculate the expected value of (1)....
| [reply] |
Interesting. I think there is a fundamental problem here arising from "For a random hash of "<n"> keys into "<k"> buckets,..". Presuming we can even define what random input means to the hash algorithm, the hash function will not generate random indicies.
There is a lot of tuning and just plain raw running a lot of what are supposed to be realistic test cases involved in picking a hash algorithm.
To the best of my knowledge, the Perl hash algorithm is:
unsigned int hash = 0;
while (*s) /* s is pointer to string */
hash = hash * 33 + *s++;
return hash;
As it turns out, this is a very efficient algorithm with Intel processors to calculate (which is definitely part of any quality benchmark). If you try to replace that integer multiply with some shifts, the performance will just dive into the dirt. That's because on a modern Intel processor an integer multiply is essentially the same as an integer add. I was stunned when I found that out.
I have a C app where I changed the hash key generation algorithm from the Perl one to a different one based upon empirical test data. Hash algorithms appear to be very much an empirical art.
Anyway, there are a lot of variables.
I'm skeptical, a single Q, Quality factor can be derived for a particular algorithm independent of a bunch of assumptions about the input data.
Update: From what I remember, the Perl hash algorithm will double the number of hash indicies when the total number of things in the hash exceeds total number of "starting slots/2". There is an errata to "Advanced Perl Programming, page 340 that points this out. This "when do we get bigger?" is also very important!
| [reply] [d/l] |
/* hash a key */
/* FYI: This is the "One-at-a-Time" algorithm by Bob Jenkins
* from requirements by Colin Plumb.
* (http://burtleburtle.net/bob/hash/doobs.html) */
/* The use of a temporary pointer and the casting games
* is needed to serve the dual purposes of
* (a) the hashed data being interpreted as "unsigned char" (new since
+ 5.8,
* a "char" can be either signed or unsigned, depending on the com
+piler)
* (b) catering for old code that uses a "char"
*
* The "hash seed" feature was added in Perl 5.8.1 to perturb the resu
+lts
* to avoid "algorithmic complexity attacks".
*
* If USE_HASH_SEED is defined, hash randomisation is done by default
* If USE_HASH_SEED_EXPLICIT is defined, hash randomisation is done
* only if the environment variable PERL_HASH_SEED is set.
* For maximal control, one can define PERL_HASH_SEED.
* (see also perl.c:perl_parse()).
*/
#ifndef PERL_HASH_SEED
# if defined(USE_HASH_SEED) || defined(USE_HASH_SEED_EXPLICIT)
# define PERL_HASH_SEED PL_hash_seed
# else
# define PERL_HASH_SEED 0
# endif
#endif
#define PERL_HASH(hash,str,len) PERL_HASH_INTERNAL_(hash,str,len,0)
/* Only hv.c and mod_perl should be doing this. */
#ifdef PERL_HASH_INTERNAL_ACCESS
#define PERL_HASH_INTERNAL(hash,str,len) PERL_HASH_INTERNAL_(hash,str,
+len,1)
#endif
/* Common base for PERL_HASH and PERL_HASH_INTERNAL that parameterises
* the source of the seed. Not for direct use outside of hv.c. */
#define PERL_HASH_INTERNAL_(hash,str,len,internal) \
STMT_START { \
register const char * const s_PeRlHaSh_tmp = str; \
register const unsigned char *s_PeRlHaSh = (const unsigned cha
+r *)s_PeRl
HaSh_tmp; \
register I32 i_PeRlHaSh = len; \
register U32 hash_PeRlHaSh = (internal ? PL_rehash_seed : PERL
+_HASH_SEED
); \
while (i_PeRlHaSh--) { \
hash_PeRlHaSh += *s_PeRlHaSh++; \
hash_PeRlHaSh += (hash_PeRlHaSh << 10); \
hash_PeRlHaSh ^= (hash_PeRlHaSh >> 6); \
} \
hash_PeRlHaSh += (hash_PeRlHaSh << 3); \
hash_PeRlHaSh ^= (hash_PeRlHaSh >> 11); \
(hash) = (hash_PeRlHaSh + (hash_PeRlHaSh << 15)); \
} STMT_END
| [reply] [d/l] [select] |
I think there is a fundamental problem here arising from "For a random hash of "<n"> keys into "<k"> buckets,..". Presuming we can even define what random input means to the hash algorithm, the hash function will not generate random indicies.
I'll grant you the terminology is poor, but I don't see how they could possibly be referring to anything but an optimially balanced hash, one with with the same number of elements in each bucket.
It the formula was really the result of a sampling of random hashes, the formula wouldn't have been derived from "the sum of the squares of the number of entries in each bucket."
| [reply] |
I think I understand you and I don't see any big problem.
| [reply] |
It turns out the perldoc is correct,
a friend of mine gets the same result by rigorous mathematical calculation.
| [reply] |
| [reply] |