in reply to Re^3: 64-bit digest algorithms
in thread 64-bit digest algorithms

This (see the red on the graphic and the conclusions) is why you don't use cyclic redundancy check algorithms for hashing purposes.

Replies are listed 'Best First'.
Re^5: 64-bit digest algorithms
by gone2015 (Deacon) on Nov 20, 2008 at 20:03 UTC
    why you don't use cyclic redundancy check algorithms for hashing purposes.

    This surprised me. So I've been running a few tests. I've dug out my copy of Knuth, and refreshed my memory on how you calculate the probability of 'c' collisions if you throw 'm' balls at random into 'n' bins. Then I ran three tests.

    Each test uses randomly constructed 8-12 byte strings, each byte being 0..255 -- the distribution of string lengths and byte values should be even, assuming my 64-bit random number generator is good (which I believe it is).

    The tests were:

    1. take 1M (2^20) strings and count the number of collisions when using (a) CRC-32 and (b) FNV (see below). This was done 500 times, and the resulting distribution of numbers of collisions plotted against the expected distribution. See: Test-1.

    2. take 16,384 (2^14) strings and count the number of collisions when using (a) CRC-32 and (b) FNV, and XORing the MS and LS halves to give a 16-bit hash. This was done 1,000 times, and the resulting distribution of numbers of collisions plotted against the expected distribution. See: Test-2.

    3. as (2) but using (a) the LS and (b) the MS halves of CRC-32 to give a 16-bit hash. This was done 1,000 times. See: Test-3.

    Visual inspection of these plots does not suggest that CRC-32 is clearly distinguishable from uniform randomness. The middle part of the FNV distributions is a little taller, suggesting it is a little less variable in it's results. The full 32-bit CRC-32 hash leans a little towards a few more collisions than expected, a little, perhaps...

    I'm not sure whether to apply chi-squared or Kolmogorov-Smirnov to the data I have... but the plots don't suggest to me that CRC-32 is a bad hash.

    If anyone can suggest a better test or a better statistical approach, or has contrary results, I'd love to hear !


    The FNV hash is taken from http://bretm.home.comcast.net/~bretm/hash/6.html. What it comes down to is:

    U32 FNV_Hash(U8 * p, int l) { U32 hash = 2166136261 ; while (l--) { hash = (hash * 16777619) ^ *p++ ; } ; return hash ; } ;

      This is a follow-up/in addition to my previous reply. I finally got around to adapting some code I wrote to assess md5 to access CRC32, The result is this image

      This image plots the state affect of changing bits in a randomly selected 128-bit input to CRC32 (plotted vertically), against

      1. the 32-bits in the output (plotted horizontally in the left 2/3rds of the image) and:
      2. the 16-bits resulting from XORing the upper and lower halves of the CRC32 together (plotted horizontally in the right third of the image).
      3. There is a "1-pixel" band of mid-grey surrounding the image, and vertically between the two sets of data.
      4. The image has been stretched x4 both horizontally and vertically to make discerning the individual pixels easier.
      5. The image was generated from accumulating counts of changes of pixels in the (two) outputs for each change of state of the 128-bits in the inputs, for each of 1 million 16-byte randomly chosen inputs.

        Although, the number of trials is in effect irrelevant as it doesn't vary (at all) from the state reached after just 1000 trials.

      6. The totals of the state changes accumulated, have had (slightly modified) chi-squared algorithm applied to them, to stretch the differences from the expected average.

        Though as will be explained, the results of this are minimal as the results are very "black & white" (a pun which will also be explained.)

      So, for each randomly chosen input, the CRC is calculated. Then, each bit of the input in turn is toggled, the CRC32 recalculated, and the differences between the two CRCs is are counted and accumulated for the number of trials. And the chi-squared is applied to emphasis the differences from the expected averages.

      Thus, for each pixel, a mid-grey color (matching the border, represents the expected 50% of the pixels change ideal. A pure black pixel indicates that that pixel never changed. A pure white pixel indicates that that pixel always changed. And, as I said above, the results are very black & white. And they indicate that CRC32 makes for a very poor hashing algorithm, because with some columns are entirely white regardless of the input, and other entirely black, it means that some hash values will be way over used and other never.

      The mid-grey and black vertical strips on the right hand third of the image is the result of performing the same processing upon a 16-bit XOR-fold of the CRC32s produced above. The absence of white would seem to indicate that this has improved the result, but the presence of pure black just goes to show that you cannot generate bits using XOR, only dilute them. Upshot: XOR folding is bad for hashes. Truncation is okay because it uses the bits available without diluting them.

      And that brings me back to your collisions graphs. I think you got your math wrong. I simply cannot believe that you should expect 1890 collisions from 16384 randomly chosen samples from a domain of 2**32 possibilities. Birthday paradox or not, that is way, way too high a collision rate. Way too high.

      Hmm. I uploaded the image and the size displayed on the update page conformed it was seen, but for 20 minutes now, it fails to be rendered back to my browser. So you may not be able to see it.

      However, if you download the code below, save it as CRCtest.pl and run it with the command line: perl -s CRC32test.pl -TRIALS=1e5 -MAP=BW, it will be reproduced for you locally, though you will probably need to arrange to load it into an image app yourself (if you use *nix), and then stretch it a little to see it clearly. Updated to pack the CRC. Image updated.

      Note:There are various other command line arguments available, including color ramping. If you cannot work out how to use them, ask.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I'll read the rest again and try stuff in the morning... In the meantime:

        And that brings me back to your collisions graphs. I think you got your math wrong. I simply cannot believe that you should expect 1890 collisions from 16384 randomly chosen samples from a domain of 2**32 possibilities. Birthday paradox or not, that is way, way too high a collision rate. Way too high.

        The 1890 collisions is when mungeing the 32-bit hashes into 16-bit hashes -- so it's 16,364 random thingies in 65,536 bins... So roughly speaking:

        1. after 4,096 tosses ~1/16 of the bins are occupied...

        2. ...so the next 4,096 tosses ~4,096 * 1/16 = 256 collisions, and ~ 2/16 of the bins are occupied...

        3. ...so the next 4,096 tosses ~4,096 * 2/16 = 512 collisions, and ~ 3/16 of the bins are occupied...

        4. ...so the next 4,096 tosses ~4,096 * 3/16 = 768 collisions...

        ...total 1,536 -- accepting that this is an underestimate.

      Disclaimer 1: I've no idea who you are or what your expertise level is. You could even be Knuth himself, though the fact that his 'reference' hashing algorithm turns out to have pretty abismal performance shows that even the very best can make errors.

      Disclaimer 2: I'm not a mathematician, which is why I seek out the most authoritative reference material I can find. And the references I gave are it.

      That said, I'm going to challange the effectiveness of your test methodology.

      By feeding your tests with 8..12 character strings of the full range of byte values (0..255) you are creating optimal conditions for the algorithms to produce optimal results.

      With randomly chosen bytes( 0..255 ), you have ensured that the inputs to the algorithms have a statistically even chance of all bits (0..7) of the bytes being set with exactly even odds for the frequencies of all bits. Ie.

      for my $n ( 0 .. 255 ) { vec( chr( $n ), $_, 1 ) and $h{ $_ }++ for 0 .. 7 };; print "$_ : $h{ $_ }" for 0 .. 7;; 0 : 128 1 : 128 2 : 128 3 : 128 4 : 128 5 : 128 6 : 128 7 : 128

      In fact, if you used 4-byte strings of bytes( rand( 0..255 ) ), no algorithm is necessary, because just treating those 4 bytes as an integer forms a 32-bit "perfect hash". Increasing the length to 8 or 12-bytes does nothing to alter the probabilities, they remain even across the board assuming infinite trials.

      Adding 9, 10, and 11 characters strings of bytes( rand( 0 .. 255 ) ), skews the mix a little, but the skewing is probably lost in the noise. With an average of 10-char strings of 0..255, the total population is 256^10 = 1208925819614629174706176, which mean that even your first test (where the sample set was 2^20 ), is only sampling 8.6736173798840354720596224069595e-17% of the total population. That's just 0.000000000000000086%!

      I think if polsters used such a small sample, their obligatory small print caveat of +/-3% error would have to be more like +/-99.9% error. For the other two tests where the sample size is just 16384, that drops to just 0.0000000000000000013%. Even Monte Carlo simulations require a far higher sample size than this to achieve accuracy.

      For example 10e6 samples of the Monte Carlo approximation of PI doesn't achieve any great accuracy:

      $n=0; rand()**2+rand()**2 <=1 and ++$n, printf "\r$_: %.10f", 4*( $n / $_ ) for 1 .. 10e6;; 10000000: 3.1419312000

      And that is a 5.4210108624275221700372640043497e-11% sample of the population size(*) which is several orders of magnitide higher than yours above.

      *Using the 32-bit Math::Random::MT::rand() to pick points in a 1.0x1.0 square, gives a finite range of possibilities: 2^32x2^32 = 2^64; 10^6 / 2^64 * 100 == 5.4e-11%

      The problems that arise with hashing algorithms producing high rates of collisions occur when their failure to produce good dispersal (due to funnelling), is exacerbated by input that does not provide a wide range of bits in the input. Ie. When the input strings consist entirely of some restricted subset of their full range of possibilities.

      For example, when the keys being hashed consist entirely of upper case letters; or just numeric digits. Try performing your test 1 with strings that range from '0' .. '99999999' and see what results you get?

      And to achieve some level of statistical confidence in the results, sample a much higher proportion of the inputs. Using just 0-8 characters string s of just digits, you should be able to exercise the entire population quite quickly.

      You should also favour the FNV-1a algorithm over the FNV-1 which is known to be less good with short inputs. (That just means doing the XOR before the multiple rather than after!)

      For completness add Jenkin's Lookup3 algorithm to your tests.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        First: I am genuinely disappointed to see that your reference found CRC-32 such a bad hash -- apart from anything else, I've used it with (I believe !) some success. So, I'm intrigued and looking at this in the interests of learning something. Learn something new every day...

        Second: in any case, amongst my failings is a refusal to accept an explanation I don't understand. So I'm trying to understand.

        What I've tried, so far, is to step back from ideas of "avalanche" and "funnelling" and other stuff whose significance and effect I don't (yet) see how to quantify, and concentrate simply on measuring collisions. Simple-minded, but direct, I think.

        Now, it may be that there exists a perfect 32-bit hash function that somehow stirs up its inputs so that all parts of the input value affect all parts of the hash. So that one can then use some subset, or some "folding" of the hash to provide a shorter hash result. My gut feeling is that this is a very strong requirement, and requires a complex hash function.


        Anyway, I see you were not at all impressed by my initial testing against randomly constructed 8-12 byte strings of bytes 0x00..0xFF.

        The object of the exercise is to see if the hash function produces an even distribution (or better) of hash values, for the sorts of inputs we expect. It seemed to me that by starting with this random input I would, at least, discover if CRC-32 (or FNV) was bad -- if the input to the hash was evenly randomly distributed, but the output wasn't, then one could stop there.

        I think you are concerned that in taking ~10^6 strings (500 times) I'm not likely to find difficult cases in the ~10^24 space. My primitive statistics suggests that it's the sample size, not the population size that matters (for large enough of both). 1/sqrt(10^6) is 0.1% -- suggesting 99.9% confidence ! (It really isn't intuitive that the population size is not taken into account... happy to improve my understanding if I've got this wrong !)

        Put it another way. Suppose there is 1 difficult case in a million. That implies (1 - 10^-6) chance of not spotting it. So, in 10^6 samples, there's a 37% chance of missing it. I ran 500 trials of 10^6 each. If there is 1 difficult case in 100,000 there's a 0.005% chance of missing it.

        On the other hand, that doesn't show that the hash function is good -- because to measure that requires some agreement about what sort of input is "typical". Or, perhaps, come up with something which is "worst case" -- so that doing a good job with that input would give (reasonable) assurance of good performance on "typical" input.

        Anyway, I've modified my testing, and tested as you suggested: strings of length 1..8 made up of characters '0'..'9', only. I ensured that only unique strings were hashed. I used FNV1a. For each run I fed the same 1,048,576 strings to CRC-32 and FNV1a. I also combined my testing of the full 32-bit hash with the (folded) 16-out-of-32-bit hash -- using the same strings, and ending up with 32,000 runs for the 16/32-bit hashing (where previously I had 1,000).

        With the new code I reran the original tests. The results are:

        1. using the full 32 bits of the hashes, 1,048,576 strings of type 'rand1' (8-12 bytes of 0x00..0xFF) -- see http://www.highwayman.com/perls/Hash-32a_1.pdf.

        2. folding the 32 bits of the hashes down to 16 bits by XORing MS and LS halves together -- see http://www.highwayman.com/perls/Hash-32a_2.pdf.

        3. using MS and LS halves of the CRC-32 separately -- see http://www.highwayman.com/perls/Hash-32a_3.pdf. Similarly the MS and LS halves of FVN1a -- see http://www.highwayman.com/perls/Hash-32a_4.pdf. (This is to see if either half exhibits any non-uniformity.)

        These are similar to the previous results, except that with the extra number of trials, the 16-bit hashes follow the expected curve more tightly ! Anyway, I don't think I broke it.

        So now, the results of the new tests, using strings '0'..'99999999' as you suggested:

        1. using the full 32 bits of the hashes, 1,048,576 strings of type 'rand2' (1-8 bytes of '0'..'9') -- see http://www.highwayman.com/perls/Hash-32a_5.pdf.

          The interesting thing here is that both FNV1a and CRC-32 "outperform" what you'd expect to see from a purely random distribution of hash values -- giving fewer collisions.

        2. folding the 32 bits of the hashes down to 16 bits by XORing MS and LS halves together -- see http://www.highwayman.com/perls/Hash-32a_6.pdf.

          Here the results are much closer to the expected random curve -- the FNV1a pretty much on it, the CRC-32 a bit to the right (slightly more collisions).

        3. using MS and LS halves of the CRC-32 separately -- see http://www.highwayman.com/perls/Hash-32a_7.pdf. Similarly the MS and LS halves of FVN1a -- see http://www.highwayman.com/perls/Hash-32a_8.pdf.

          Here we see the two halves of the CRC-32 are not quite the same... the MS is slightly better than random, the LS is slightly worse. For the FNV1a there is nothing to choose between them. This appears to support the suggestion that CRC-32 is not "mixing" its input as much as FNV1a.


        Conclusions so far:

        1. I haven't yet seen strong support for CRC-32 not being an effective hash, at least when using either the entire 32-bits, or XORing down to 16-bits, or just using the MS 16-bits.

        2. I have seen a glimpse of CRC-32 not spreading the input bits around, so I would be leery of using it except as above.

        3. if anyone has any good ideas about how to construct a "typical, general purpose" input data set, or a "worst case" data set, I should be delighted to try it !

        4. I'll have a go at Jenkin's Lookup3 algorithm and study the article. (Running light... ah me. Those were the days.)

        5. this gives all the appearance of a black-art !