in reply to Re^2: 64-bit digest algorithms
in thread 64-bit digest algorithms
When I was playing around recently I used CRC32, taken once forwards across the input and a second time backwards across it. This certainly increased the effective space of the hash -- my sums is not up to any assertion about whether that gives a full 64 bits of "entropy".
There are also two other respectable 32 bit CRCs -- avoiding the need to reverse the input.
There is also CRC64 itself, but a quick poke at CPAN doesn't reveal an implementation (except buried in some "Bio..." stuff). Wouldn't be too difficult to bang out a table driven CRC64.
You don't want anything crypto-secure, so a CRC could well be sufficient, and a good implementation will run like stink compared to any of the crypto-digest stuff.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: 64-bit digest algorithms
by massa (Hermit) on Nov 13, 2008 at 11:17 UTC | |
[]s, HTH, Massa (κς,πμ,πλ)
| [reply] |
|
Re^4: 64-bit digest algorithms
by BrowserUk (Patriarch) on Nov 15, 2008 at 06:49 UTC | |
This (see the red on the graphic and the conclusions) is why you don't use cyclic redundancy check algorithms for hashing purposes. | [reply] |
by gone2015 (Deacon) on Nov 20, 2008 at 20:03 UTC | |
why you don't use cyclic redundancy check algorithms for hashing purposes. This surprised me. So I've been running a few tests. I've dug out my copy of Knuth, and refreshed my memory on how you calculate the probability of 'c' collisions if you throw 'm' balls at random into 'n' bins. Then I ran three tests. Each test uses randomly constructed 8-12 byte strings, each byte being 0..255 -- the distribution of string lengths and byte values should be even, assuming my 64-bit random number generator is good (which I believe it is). The tests were: Visual inspection of these plots does not suggest that CRC-32 is clearly distinguishable from uniform randomness. The middle part of the FNV distributions is a little taller, suggesting it is a little less variable in it's results. The full 32-bit CRC-32 hash leans a little towards a few more collisions than expected, a little, perhaps... I'm not sure whether to apply chi-squared or Kolmogorov-Smirnov to the data I have... but the plots don't suggest to me that CRC-32 is a bad hash. If anyone can suggest a better test or a better statistical approach, or has contrary results, I'd love to hear ! The FNV hash is taken from http://bretm.home.comcast.net/~bretm/hash/6.html. What it comes down to is:
| [reply] [d/l] |
by BrowserUk (Patriarch) on Nov 21, 2008 at 16:23 UTC | |
This is a follow-up/in addition to my previous reply. I finally got around to adapting some code I wrote to assess md5 to access CRC32, The result is this image This image plots the state affect of changing bits in a randomly selected 128-bit input to CRC32 (plotted vertically), against So, for each randomly chosen input, the CRC is calculated. Then, each bit of the input in turn is toggled, the CRC32 recalculated, and the differences between the two CRCs is are counted and accumulated for the number of trials. And the chi-squared is applied to emphasis the differences from the expected averages. Thus, for each pixel, a mid-grey color (matching the border, represents the expected 50% of the pixels change ideal. A pure black pixel indicates that that pixel never changed. A pure white pixel indicates that that pixel always changed. And, as I said above, the results are very black & white. And they indicate that CRC32 makes for a very poor hashing algorithm, because with some columns are entirely white regardless of the input, and other entirely black, it means that some hash values will be way over used and other never. The mid-grey and black vertical strips on the right hand third of the image is the result of performing the same processing upon a 16-bit XOR-fold of the CRC32s produced above. The absence of white would seem to indicate that this has improved the result, but the presence of pure black just goes to show that you cannot generate bits using XOR, only dilute them. Upshot: XOR folding is bad for hashes. Truncation is okay because it uses the bits available without diluting them. And that brings me back to your collisions graphs. I think you got your math wrong. I simply cannot believe that you should expect 1890 collisions from 16384 randomly chosen samples from a domain of 2**32 possibilities. Birthday paradox or not, that is way, way too high a collision rate. Way too high.
However, if you download the code below, save it as CRCtest.pl and run it with the command line: perl -s CRC32test.pl -TRIALS=1e5 -MAP=BW, it will be reproduced for you locally, though you will probably need to arrange to load it into an image app yourself (if you use *nix), and then stretch it a little to see it clearly. Updated to pack the CRC. Image updated.
Note:There are various other command line arguments available, including color ramping. If you cannot work out how to use them, ask. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by gone2015 (Deacon) on Nov 22, 2008 at 03:05 UTC | |
by BrowserUk (Patriarch) on Nov 20, 2008 at 23:28 UTC | |
Disclaimer 1: I've no idea who you are or what your expertise level is. You could even be Knuth himself, though the fact that his 'reference' hashing algorithm turns out to have pretty abismal performance shows that even the very best can make errors. Disclaimer 2: I'm not a mathematician, which is why I seek out the most authoritative reference material I can find. And the references I gave are it. That said, I'm going to challange the effectiveness of your test methodology. By feeding your tests with 8..12 character strings of the full range of byte values (0..255) you are creating optimal conditions for the algorithms to produce optimal results. With randomly chosen bytes( 0..255 ), you have ensured that the inputs to the algorithms have a statistically even chance of all bits (0..7) of the bytes being set with exactly even odds for the frequencies of all bits. Ie.
In fact, if you used 4-byte strings of bytes( rand( 0..255 ) ), no algorithm is necessary, because just treating those 4 bytes as an integer forms a 32-bit "perfect hash". Increasing the length to 8 or 12-bytes does nothing to alter the probabilities, they remain even across the board assuming infinite trials. Adding 9, 10, and 11 characters strings of bytes( rand( 0 .. 255 ) ), skews the mix a little, but the skewing is probably lost in the noise. With an average of 10-char strings of 0..255, the total population is 256^10 = 1208925819614629174706176, which mean that even your first test (where the sample set was 2^20 ), is only sampling 8.6736173798840354720596224069595e-17% of the total population. That's just 0.000000000000000086%! I think if polsters used such a small sample, their obligatory small print caveat of +/-3% error would have to be more like +/-99.9% error. For the other two tests where the sample size is just 16384, that drops to just 0.0000000000000000013%. Even Monte Carlo simulations require a far higher sample size than this to achieve accuracy. For example 10e6 samples of the Monte Carlo approximation of PI doesn't achieve any great accuracy:
And that is a 5.4210108624275221700372640043497e-11% sample of the population size(*) which is several orders of magnitide higher than yours above. *Using the 32-bit Math::Random::MT::rand() to pick points in a 1.0x1.0 square, gives a finite range of possibilities: 2^32x2^32 = 2^64; 10^6 / 2^64 * 100 == 5.4e-11% The problems that arise with hashing algorithms producing high rates of collisions occur when their failure to produce good dispersal (due to funnelling), is exacerbated by input that does not provide a wide range of bits in the input. Ie. When the input strings consist entirely of some restricted subset of their full range of possibilities. For example, when the keys being hashed consist entirely of upper case letters; or just numeric digits. Try performing your test 1 with strings that range from '0' .. '99999999' and see what results you get? And to achieve some level of statistical confidence in the results, sample a much higher proportion of the inputs. Using just 0-8 characters string s of just digits, you should be able to exercise the entire population quite quickly. You should also favour the FNV-1a algorithm over the FNV-1 which is known to be less good with short inputs. (That just means doing the XOR before the multiple rather than after!) For completness add Jenkin's Lookup3 algorithm to your tests. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by gone2015 (Deacon) on Nov 22, 2008 at 02:48 UTC | |
by BrowserUk (Patriarch) on Nov 22, 2008 at 05:47 UTC | |
| |