in reply to Re^5: 64-bit digest algorithms
in thread 64-bit digest algorithms

Disclaimer 1: I've no idea who you are or what your expertise level is. You could even be Knuth himself, though the fact that his 'reference' hashing algorithm turns out to have pretty abismal performance shows that even the very best can make errors.

Disclaimer 2: I'm not a mathematician, which is why I seek out the most authoritative reference material I can find. And the references I gave are it.

That said, I'm going to challange the effectiveness of your test methodology.

By feeding your tests with 8..12 character strings of the full range of byte values (0..255) you are creating optimal conditions for the algorithms to produce optimal results.

With randomly chosen bytes( 0..255 ), you have ensured that the inputs to the algorithms have a statistically even chance of all bits (0..7) of the bytes being set with exactly even odds for the frequencies of all bits. Ie.

for my $n ( 0 .. 255 ) { vec( chr( $n ), $_, 1 ) and $h{ $_ }++ for 0 .. 7 };; print "$_ : $h{ $_ }" for 0 .. 7;; 0 : 128 1 : 128 2 : 128 3 : 128 4 : 128 5 : 128 6 : 128 7 : 128

In fact, if you used 4-byte strings of bytes( rand( 0..255 ) ), no algorithm is necessary, because just treating those 4 bytes as an integer forms a 32-bit "perfect hash". Increasing the length to 8 or 12-bytes does nothing to alter the probabilities, they remain even across the board assuming infinite trials.

Adding 9, 10, and 11 characters strings of bytes( rand( 0 .. 255 ) ), skews the mix a little, but the skewing is probably lost in the noise. With an average of 10-char strings of 0..255, the total population is 256^10 = 1208925819614629174706176, which mean that even your first test (where the sample set was 2^20 ), is only sampling 8.6736173798840354720596224069595e-17% of the total population. That's just 0.000000000000000086%!

I think if polsters used such a small sample, their obligatory small print caveat of +/-3% error would have to be more like +/-99.9% error. For the other two tests where the sample size is just 16384, that drops to just 0.0000000000000000013%. Even Monte Carlo simulations require a far higher sample size than this to achieve accuracy.

For example 10e6 samples of the Monte Carlo approximation of PI doesn't achieve any great accuracy:

$n=0; rand()**2+rand()**2 <=1 and ++$n, printf "\r$_: %.10f", 4*( $n / $_ ) for 1 .. 10e6;; 10000000: 3.1419312000

And that is a 5.4210108624275221700372640043497e-11% sample of the population size(*) which is several orders of magnitide higher than yours above.

*Using the 32-bit Math::Random::MT::rand() to pick points in a 1.0x1.0 square, gives a finite range of possibilities: 2^32x2^32 = 2^64; 10^6 / 2^64 * 100 == 5.4e-11%

The problems that arise with hashing algorithms producing high rates of collisions occur when their failure to produce good dispersal (due to funnelling), is exacerbated by input that does not provide a wide range of bits in the input. Ie. When the input strings consist entirely of some restricted subset of their full range of possibilities.

For example, when the keys being hashed consist entirely of upper case letters; or just numeric digits. Try performing your test 1 with strings that range from '0' .. '99999999' and see what results you get?

And to achieve some level of statistical confidence in the results, sample a much higher proportion of the inputs. Using just 0-8 characters string s of just digits, you should be able to exercise the entire population quite quickly.

You should also favour the FNV-1a algorithm over the FNV-1 which is known to be less good with short inputs. (That just means doing the XOR before the multiple rather than after!)

For completness add Jenkin's Lookup3 algorithm to your tests.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^7: 64-bit digest algorithms
by gone2015 (Deacon) on Nov 22, 2008 at 02:48 UTC

    First: I am genuinely disappointed to see that your reference found CRC-32 such a bad hash -- apart from anything else, I've used it with (I believe !) some success. So, I'm intrigued and looking at this in the interests of learning something. Learn something new every day...

    Second: in any case, amongst my failings is a refusal to accept an explanation I don't understand. So I'm trying to understand.

    What I've tried, so far, is to step back from ideas of "avalanche" and "funnelling" and other stuff whose significance and effect I don't (yet) see how to quantify, and concentrate simply on measuring collisions. Simple-minded, but direct, I think.

    Now, it may be that there exists a perfect 32-bit hash function that somehow stirs up its inputs so that all parts of the input value affect all parts of the hash. So that one can then use some subset, or some "folding" of the hash to provide a shorter hash result. My gut feeling is that this is a very strong requirement, and requires a complex hash function.


    Anyway, I see you were not at all impressed by my initial testing against randomly constructed 8-12 byte strings of bytes 0x00..0xFF.

    The object of the exercise is to see if the hash function produces an even distribution (or better) of hash values, for the sorts of inputs we expect. It seemed to me that by starting with this random input I would, at least, discover if CRC-32 (or FNV) was bad -- if the input to the hash was evenly randomly distributed, but the output wasn't, then one could stop there.

    I think you are concerned that in taking ~10^6 strings (500 times) I'm not likely to find difficult cases in the ~10^24 space. My primitive statistics suggests that it's the sample size, not the population size that matters (for large enough of both). 1/sqrt(10^6) is 0.1% -- suggesting 99.9% confidence ! (It really isn't intuitive that the population size is not taken into account... happy to improve my understanding if I've got this wrong !)

    Put it another way. Suppose there is 1 difficult case in a million. That implies (1 - 10^-6) chance of not spotting it. So, in 10^6 samples, there's a 37% chance of missing it. I ran 500 trials of 10^6 each. If there is 1 difficult case in 100,000 there's a 0.005% chance of missing it.

    On the other hand, that doesn't show that the hash function is good -- because to measure that requires some agreement about what sort of input is "typical". Or, perhaps, come up with something which is "worst case" -- so that doing a good job with that input would give (reasonable) assurance of good performance on "typical" input.

    Anyway, I've modified my testing, and tested as you suggested: strings of length 1..8 made up of characters '0'..'9', only. I ensured that only unique strings were hashed. I used FNV1a. For each run I fed the same 1,048,576 strings to CRC-32 and FNV1a. I also combined my testing of the full 32-bit hash with the (folded) 16-out-of-32-bit hash -- using the same strings, and ending up with 32,000 runs for the 16/32-bit hashing (where previously I had 1,000).

    With the new code I reran the original tests. The results are:

    1. using the full 32 bits of the hashes, 1,048,576 strings of type 'rand1' (8-12 bytes of 0x00..0xFF) -- see http://www.highwayman.com/perls/Hash-32a_1.pdf.

    2. folding the 32 bits of the hashes down to 16 bits by XORing MS and LS halves together -- see http://www.highwayman.com/perls/Hash-32a_2.pdf.

    3. using MS and LS halves of the CRC-32 separately -- see http://www.highwayman.com/perls/Hash-32a_3.pdf. Similarly the MS and LS halves of FVN1a -- see http://www.highwayman.com/perls/Hash-32a_4.pdf. (This is to see if either half exhibits any non-uniformity.)

    These are similar to the previous results, except that with the extra number of trials, the 16-bit hashes follow the expected curve more tightly ! Anyway, I don't think I broke it.

    So now, the results of the new tests, using strings '0'..'99999999' as you suggested:

    1. using the full 32 bits of the hashes, 1,048,576 strings of type 'rand2' (1-8 bytes of '0'..'9') -- see http://www.highwayman.com/perls/Hash-32a_5.pdf.

      The interesting thing here is that both FNV1a and CRC-32 "outperform" what you'd expect to see from a purely random distribution of hash values -- giving fewer collisions.

    2. folding the 32 bits of the hashes down to 16 bits by XORing MS and LS halves together -- see http://www.highwayman.com/perls/Hash-32a_6.pdf.

      Here the results are much closer to the expected random curve -- the FNV1a pretty much on it, the CRC-32 a bit to the right (slightly more collisions).

    3. using MS and LS halves of the CRC-32 separately -- see http://www.highwayman.com/perls/Hash-32a_7.pdf. Similarly the MS and LS halves of FVN1a -- see http://www.highwayman.com/perls/Hash-32a_8.pdf.

      Here we see the two halves of the CRC-32 are not quite the same... the MS is slightly better than random, the LS is slightly worse. For the FNV1a there is nothing to choose between them. This appears to support the suggestion that CRC-32 is not "mixing" its input as much as FNV1a.


    Conclusions so far:

    1. I haven't yet seen strong support for CRC-32 not being an effective hash, at least when using either the entire 32-bits, or XORing down to 16-bits, or just using the MS 16-bits.

    2. I have seen a glimpse of CRC-32 not spreading the input bits around, so I would be leery of using it except as above.

    3. if anyone has any good ideas about how to construct a "typical, general purpose" input data set, or a "worst case" data set, I should be delighted to try it !

    4. I'll have a go at Jenkin's Lookup3 algorithm and study the article. (Running light... ah me. Those were the days.)

    5. this gives all the appearance of a black-art !

      I'm convinced by the arguments put forward by the apparent current experts in the field. I find the fact that (for example), Bob Jenkins is submitting his entry Maraca to the NIST SHA-3 hash competition a convincing argument for his expertise. And I find Bret Mulvay's descriptions of hash function testing strategies clear and intuative; and that it is referenced by many other apparent experts pursuasive.

      I'm convince that as the input domain is infinite, that brute force testing is impossible. And that any testing performed by sampling the input domain is inadequate.

      It is inadequate because it only tests for the fairness of the distributions of the outputs within a single, fixed N-bit output domain, but for general purpose hashing it must produce fair distributions in all table size of 2^n where n ranges from some small value, say 3, upto and including N. I'm convinced that as in general purpose hashing, hash buckets are allocated and reallocated dynamically as the number of keys grows, that there is an inherent and unavoidable need to be able to rely upon binary truncation of the output domain to fairly map the same N-bit hash value to different 2^(n<=N) table sizes.

      And I'm pursuaded by the arguments (the avalanche and funnelling effects), that the only way that this will happen, is if each bit of the input affects (on average) 50% of the bits in the N-bit sized output. I find the arguments for these effects pursuasive and the tests for them understandable. And as the output domain is finite and routinely subdividable, I find the argument for testing strategies that test the effects of bit-wise changes to the inputs, on the outputs, a more sensible and statistically valid approach.

      Bottom line: I'm convinced on the basis of the above, and much else, that CRC32 which was never designed as a hashing function, is a poor choice when there are other functions that have been designed--and tested--specifically for this purpose. And I am convinced, on the basis of the affects on the output distribution frequencies of bits toggled, that XOR-folding as a mechanism for truncation of N-bit hash values is a bad practice.

      As such, I will not be using either for my hash function purposes.

      Good luck with your research.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        OK, an n-bit hash function where some subset can be used to provide 3..n-bit hashes, requires strong properties of "spreading the influence of the input"... otherwise the subsets won't work well.

        If that is your requirement, then CRC is indeed not for you.

        Testing the effectiveness of a hash is hard work, and as you say, limited by the ability to select test data. So I can see the attraction of testing some properties of the hash instead. Hopefully those properties are strongly correlated with effectiveness :-) (I suppose some confidence testing is required...)

        Thanks for the references. I will now "go away" and study :-)