in reply to Re^6: 64-bit digest algorithms
in thread 64-bit digest algorithms
First: I am genuinely disappointed to see that your reference found CRC-32 such a bad hash -- apart from anything else, I've used it with (I believe !) some success. So, I'm intrigued and looking at this in the interests of learning something. Learn something new every day...
Second: in any case, amongst my failings is a refusal to accept an explanation I don't understand. So I'm trying to understand.
What I've tried, so far, is to step back from ideas of "avalanche" and "funnelling" and other stuff whose significance and effect I don't (yet) see how to quantify, and concentrate simply on measuring collisions. Simple-minded, but direct, I think.
Now, it may be that there exists a perfect 32-bit hash function that somehow stirs up its inputs so that all parts of the input value affect all parts of the hash. So that one can then use some subset, or some "folding" of the hash to provide a shorter hash result. My gut feeling is that this is a very strong requirement, and requires a complex hash function.
Anyway, I see you were not at all impressed by my initial testing against randomly constructed 8-12 byte strings of bytes 0x00..0xFF.
The object of the exercise is to see if the hash function produces an even distribution (or better) of hash values, for the sorts of inputs we expect. It seemed to me that by starting with this random input I would, at least, discover if CRC-32 (or FNV) was bad -- if the input to the hash was evenly randomly distributed, but the output wasn't, then one could stop there.
I think you are concerned that in taking ~10^6 strings (500 times) I'm not likely to find difficult cases in the ~10^24 space. My primitive statistics suggests that it's the sample size, not the population size that matters (for large enough of both). 1/sqrt(10^6) is 0.1% -- suggesting 99.9% confidence ! (It really isn't intuitive that the population size is not taken into account... happy to improve my understanding if I've got this wrong !)
Put it another way. Suppose there is 1 difficult case in a million. That implies (1 - 10^-6) chance of not spotting it. So, in 10^6 samples, there's a 37% chance of missing it. I ran 500 trials of 10^6 each. If there is 1 difficult case in 100,000 there's a 0.005% chance of missing it.
On the other hand, that doesn't show that the hash function is good -- because to measure that requires some agreement about what sort of input is "typical". Or, perhaps, come up with something which is "worst case" -- so that doing a good job with that input would give (reasonable) assurance of good performance on "typical" input.
Anyway, I've modified my testing, and tested as you suggested: strings of length 1..8 made up of characters '0'..'9', only. I ensured that only unique strings were hashed. I used FNV1a. For each run I fed the same 1,048,576 strings to CRC-32 and FNV1a. I also combined my testing of the full 32-bit hash with the (folded) 16-out-of-32-bit hash -- using the same strings, and ending up with 32,000 runs for the 16/32-bit hashing (where previously I had 1,000).
With the new code I reran the original tests. The results are:
using the full 32 bits of the hashes, 1,048,576 strings of type 'rand1' (8-12 bytes of 0x00..0xFF) -- see http://www.highwayman.com/perls/Hash-32a_1.pdf.
folding the 32 bits of the hashes down to 16 bits by XORing MS and LS halves together -- see http://www.highwayman.com/perls/Hash-32a_2.pdf.
using MS and LS halves of the CRC-32 separately -- see http://www.highwayman.com/perls/Hash-32a_3.pdf. Similarly the MS and LS halves of FVN1a -- see http://www.highwayman.com/perls/Hash-32a_4.pdf. (This is to see if either half exhibits any non-uniformity.)
These are similar to the previous results, except that with the extra number of trials, the 16-bit hashes follow the expected curve more tightly ! Anyway, I don't think I broke it.
So now, the results of the new tests, using strings '0'..'99999999' as you suggested:
using the full 32 bits of the hashes, 1,048,576 strings of type 'rand2' (1-8 bytes of '0'..'9') -- see http://www.highwayman.com/perls/Hash-32a_5.pdf.
The interesting thing here is that both FNV1a and CRC-32 "outperform" what you'd expect to see from a purely random distribution of hash values -- giving fewer collisions.
folding the 32 bits of the hashes down to 16 bits by XORing MS and LS halves together -- see http://www.highwayman.com/perls/Hash-32a_6.pdf.
Here the results are much closer to the expected random curve -- the FNV1a pretty much on it, the CRC-32 a bit to the right (slightly more collisions).
using MS and LS halves of the CRC-32 separately -- see http://www.highwayman.com/perls/Hash-32a_7.pdf. Similarly the MS and LS halves of FVN1a -- see http://www.highwayman.com/perls/Hash-32a_8.pdf.
Here we see the two halves of the CRC-32 are not quite the same... the MS is slightly better than random, the LS is slightly worse. For the FNV1a there is nothing to choose between them. This appears to support the suggestion that CRC-32 is not "mixing" its input as much as FNV1a.
Conclusions so far:
I haven't yet seen strong support for CRC-32 not being an effective hash, at least when using either the entire 32-bits, or XORing down to 16-bits, or just using the MS 16-bits.
I have seen a glimpse of CRC-32 not spreading the input bits around, so I would be leery of using it except as above.
if anyone has any good ideas about how to construct a "typical, general purpose" input data set, or a "worst case" data set, I should be delighted to try it !
I'll have a go at Jenkin's Lookup3 algorithm and study the article. (Running light... ah me. Those were the days.)
this gives all the appearance of a black-art !
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^8: 64-bit digest algorithms
by BrowserUk (Patriarch) on Nov 22, 2008 at 05:47 UTC | |
by gone2015 (Deacon) on Nov 22, 2008 at 13:47 UTC |