comment on

Disclaimer 1: I've no idea who you are or what your expertise level is. You could even be Knuth himself, though the fact that his 'reference' hashing algorithm turns out to have pretty abismal performance shows that even the very best can make errors.

Disclaimer 2: I'm not a mathematician, which is why I seek out the most authoritative reference material I can find. And the references I gave are it.

That said, I'm going to challange the effectiveness of your test methodology.

By feeding your tests with 8..12 character strings of the full range of byte values (0..255) you are creating optimal conditions for the algorithms to produce optimal results.

With randomly chosen bytes( 0..255 ), you have ensured that the inputs to the algorithms have a statistically even chance of all bits (0..7) of the bytes being set with exactly even odds for the frequencies of all bits. Ie.

for my $n ( 0 .. 255 ) { 
    vec( chr( $n ), $_, 1 ) and $h{ $_ }++ for 0 .. 7 
};;

print "$_ : $h{ $_ }" for 0 .. 7;;
0 : 128
1 : 128
2 : 128
3 : 128
4 : 128
5 : 128
6 : 128
7 : 128
[download]

In fact, if you used 4-byte strings of bytes( rand( 0..255 ) ), no algorithm is necessary, because just treating those 4 bytes as an integer forms a 32-bit "perfect hash". Increasing the length to 8 or 12-bytes does nothing to alter the probabilities, they remain even across the board assuming infinite trials.

Adding 9, 10, and 11 characters strings of bytes( rand( 0 .. 255 ) ), skews the mix a little, but the skewing is probably lost in the noise. With an average of 10-char strings of 0..255, the total population is 256^10 = 1208925819614629174706176, which mean that even your first test (where the sample set was 2^20 ), is only sampling 8.6736173798840354720596224069595e-17% of the total population. That's just 0.000000000000000086%!

I think if polsters used such a small sample, their obligatory small print caveat of +/-3% error would have to be more like +/-99.9% error. For the other two tests where the sample size is just 16384, that drops to just 0.0000000000000000013%. Even Monte Carlo simulations require a far higher sample size than this to achieve accuracy.

For example 10e6 samples of the Monte Carlo approximation of PI doesn't achieve any great accuracy:

$n=0; 
rand()**2+rand()**2 <=1 
    and ++$n, printf "\r$_: %.10f", 4*( $n / $_ ) 
    for 1 .. 10e6;;
10000000: 3.1419312000
[download]

And that is a 5.4210108624275221700372640043497e-11% sample of the population size(*) which is several orders of magnitide higher than yours above.

*Using the 32-bit Math::Random::MT::rand() to pick points in a 1.0x1.0 square, gives a finite range of possibilities: 2^32x2^32 = 2^64; 10^6 / 2^64 * 100 == 5.4e-11%

The problems that arise with hashing algorithms producing high rates of collisions occur when their failure to produce good dispersal (due to funnelling), is exacerbated by input that does not provide a wide range of bits in the input. Ie. When the input strings consist entirely of some restricted subset of their full range of possibilities.

For example, when the keys being hashed consist entirely of upper case letters; or just numeric digits. Try performing your test 1 with strings that range from '0' .. '99999999' and see what results you get?

And to achieve some level of statistical confidence in the results, sample a much higher proportion of the inputs. Using just 0-8 characters string s of just digits, you should be able to exercise the entire population quite quickly.

You should also favour the FNV-1a algorithm over the FNV-1 which is known to be less good with short inputs. (That just means doing the XOR before the multiple rather than after!)

For completness add Jenkin's Lookup3 algorithm to your tests.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^6: 64-bit digest algorithms by BrowserUk
in thread 64-bit digest algorithms by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.