why you don't use cyclic redundancy check algorithms for hashing purposes.

This surprised me. So I've been running a few tests. I've dug out my copy of Knuth, and refreshed my memory on how you calculate the probability of 'c' collisions if you throw 'm' balls at random into 'n' bins. Then I ran three tests.

Each test uses randomly constructed 8-12 byte strings, each byte being 0..255 -- the distribution of string lengths and byte values should be even, assuming my 64-bit random number generator is good (which I believe it is).

The tests were:

  1. take 1M (2^20) strings and count the number of collisions when using (a) CRC-32 and (b) FNV (see below). This was done 500 times, and the resulting distribution of numbers of collisions plotted against the expected distribution. See: Test-1.

  2. take 16,384 (2^14) strings and count the number of collisions when using (a) CRC-32 and (b) FNV, and XORing the MS and LS halves to give a 16-bit hash. This was done 1,000 times, and the resulting distribution of numbers of collisions plotted against the expected distribution. See: Test-2.

  3. as (2) but using (a) the LS and (b) the MS halves of CRC-32 to give a 16-bit hash. This was done 1,000 times. See: Test-3.

Visual inspection of these plots does not suggest that CRC-32 is clearly distinguishable from uniform randomness. The middle part of the FNV distributions is a little taller, suggesting it is a little less variable in it's results. The full 32-bit CRC-32 hash leans a little towards a few more collisions than expected, a little, perhaps...

I'm not sure whether to apply chi-squared or Kolmogorov-Smirnov to the data I have... but the plots don't suggest to me that CRC-32 is a bad hash.

If anyone can suggest a better test or a better statistical approach, or has contrary results, I'd love to hear !


The FNV hash is taken from http://bretm.home.comcast.net/~bretm/hash/6.html. What it comes down to is:

U32 FNV_Hash(U8 * p, int l) { U32 hash = 2166136261 ; while (l--) { hash = (hash * 16777619) ^ *p++ ; } ; return hash ; } ;


In reply to Re^5: 64-bit digest algorithms by gone2015
in thread 64-bit digest algorithms by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.