in reply to Re^5: 64-bit digest algorithms
in thread 64-bit digest algorithms
Disclaimer 1: I've no idea who you are or what your expertise level is. You could even be Knuth himself, though the fact that his 'reference' hashing algorithm turns out to have pretty abismal performance shows that even the very best can make errors.
Disclaimer 2: I'm not a mathematician, which is why I seek out the most authoritative reference material I can find. And the references I gave are it.
That said, I'm going to challange the effectiveness of your test methodology.
By feeding your tests with 8..12 character strings of the full range of byte values (0..255) you are creating optimal conditions for the algorithms to produce optimal results.
With randomly chosen bytes( 0..255 ), you have ensured that the inputs to the algorithms have a statistically even chance of all bits (0..7) of the bytes being set with exactly even odds for the frequencies of all bits. Ie.
for my $n ( 0 .. 255 ) { vec( chr( $n ), $_, 1 ) and $h{ $_ }++ for 0 .. 7 };; print "$_ : $h{ $_ }" for 0 .. 7;; 0 : 128 1 : 128 2 : 128 3 : 128 4 : 128 5 : 128 6 : 128 7 : 128
In fact, if you used 4-byte strings of bytes( rand( 0..255 ) ), no algorithm is necessary, because just treating those 4 bytes as an integer forms a 32-bit "perfect hash". Increasing the length to 8 or 12-bytes does nothing to alter the probabilities, they remain even across the board assuming infinite trials.
Adding 9, 10, and 11 characters strings of bytes( rand( 0 .. 255 ) ), skews the mix a little, but the skewing is probably lost in the noise. With an average of 10-char strings of 0..255, the total population is 256^10 = 1208925819614629174706176, which mean that even your first test (where the sample set was 2^20 ), is only sampling 8.6736173798840354720596224069595e-17% of the total population. That's just 0.000000000000000086%!
I think if polsters used such a small sample, their obligatory small print caveat of +/-3% error would have to be more like +/-99.9% error. For the other two tests where the sample size is just 16384, that drops to just 0.0000000000000000013%. Even Monte Carlo simulations require a far higher sample size than this to achieve accuracy.
For example 10e6 samples of the Monte Carlo approximation of PI doesn't achieve any great accuracy:
$n=0; rand()**2+rand()**2 <=1 and ++$n, printf "\r$_: %.10f", 4*( $n / $_ ) for 1 .. 10e6;; 10000000: 3.1419312000
And that is a 5.4210108624275221700372640043497e-11% sample of the population size(*) which is several orders of magnitide higher than yours above.
*Using the 32-bit Math::Random::MT::rand() to pick points in a 1.0x1.0 square, gives a finite range of possibilities: 2^32x2^32 = 2^64; 10^6 / 2^64 * 100 == 5.4e-11%
The problems that arise with hashing algorithms producing high rates of collisions occur when their failure to produce good dispersal (due to funnelling), is exacerbated by input that does not provide a wide range of bits in the input. Ie. When the input strings consist entirely of some restricted subset of their full range of possibilities.
For example, when the keys being hashed consist entirely of upper case letters; or just numeric digits. Try performing your test 1 with strings that range from '0' .. '99999999' and see what results you get?
And to achieve some level of statistical confidence in the results, sample a much higher proportion of the inputs. Using just 0-8 characters string s of just digits, you should be able to exercise the entire population quite quickly.
You should also favour the FNV-1a algorithm over the FNV-1 which is known to be less good with short inputs. (That just means doing the XOR before the multiple rather than after!)
For completness add Jenkin's Lookup3 algorithm to your tests.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^7: 64-bit digest algorithms
by gone2015 (Deacon) on Nov 22, 2008 at 02:48 UTC | |
by BrowserUk (Patriarch) on Nov 22, 2008 at 05:47 UTC | |
by gone2015 (Deacon) on Nov 22, 2008 at 13:47 UTC |