Re^3: 64-bit digest algorithms

Replies are listed 'Best First'.
Re^4: 64-bit digest algorithms by massa (Hermit) on Nov 13, 2008 at 11:17 UTC
There is an arbitrary-output-size hash function family called RadioGatún... []s, HTH, Massa (κς,πμ,πλ)	[reply]
Re^4: 64-bit digest algorithms by BrowserUk (Patriarch) on Nov 15, 2008 at 06:49 UTC
This (see the red on the graphic and the conclusions) is why you don't use cyclic redundancy check algorithms for hashing purposes.	[reply]
Re^5: 64-bit digest algorithms by gone2015 (Deacon) on Nov 20, 2008 at 20:03 UTC
why you don't use cyclic redundancy check algorithms for hashing purposes. This surprised me. So I've been running a few tests. I've dug out my copy of Knuth, and refreshed my memory on how you calculate the probability of 'c' collisions if you throw 'm' balls at random into 'n' bins. Then I ran three tests. Each test uses randomly constructed 8-12 byte strings, each byte being 0..255 -- the distribution of string lengths and byte values should be even, assuming my 64-bit random number generator is good (which I believe it is). The tests were: take 1M (2^20) strings and count the number of collisions when using (a) CRC-32 and (b) FNV (see below). This was done 500 times, and the resulting distribution of numbers of collisions plotted against the expected distribution. See: Test-1. take 16,384 (2^14) strings and count the number of collisions when using (a) CRC-32 and (b) FNV, and XORing the MS and LS halves to give a 16-bit hash. This was done 1,000 times, and the resulting distribution of numbers of collisions plotted against the expected distribution. See: Test-2. as (2) but using (a) the LS and (b) the MS halves of CRC-32 to give a 16-bit hash. This was done 1,000 times. See: Test-3. Visual inspection of these plots does not suggest that CRC-32 is clearly distinguishable from uniform randomness. The middle part of the FNV distributions is a little taller, suggesting it is a little less variable in it's results. The full 32-bit CRC-32 hash leans a little towards a few more collisions than expected, a little, perhaps... I'm not sure whether to apply chi-squared or Kolmogorov-Smirnov to the data I have... but the plots don't suggest to me that CRC-32 is a bad hash. If anyone can suggest a better test or a better statistical approach, or has contrary results, I'd love to hear ! The FNV hash is taken from http://bretm.home.comcast.net/~bretm/hash/6.html. What it comes down to is: `U32 FNV_Hash(U8 * p, int l) { U32 hash = 2166136261 ; while (l--) { hash = (hash * 16777619) ^ *p++ ; } ; return hash ; } ;` [download]	[reply] [d/l]
Re^6: 64-bit digest algorithms by BrowserUk (Patriarch) on Nov 21, 2008 at 16:23 UTC
This is a follow-up/in addition to my previous reply. I finally got around to adapting some code I wrote to assess md5 to access CRC32, The result is this image This image plots the state affect of changing bits in a randomly selected 128-bit input to CRC32 (plotted vertically), against the 32-bits in the output (plotted horizontally in the left 2/3rds of the image) and: the 16-bits resulting from XORing the upper and lower halves of the CRC32 together (plotted horizontally in the right third of the image). There is a "1-pixel" band of mid-grey surrounding the image, and vertically between the two sets of data. The image has been stretched x4 both horizontally and vertically to make discerning the individual pixels easier. The image was generated from accumulating counts of changes of pixels in the (two) outputs for each change of state of the 128-bits in the inputs, for each of 1 million 16-byte randomly chosen inputs. Although, the number of trials is in effect irrelevant as it doesn't vary (at all) from the state reached after just 1000 trials. The totals of the state changes accumulated, have had (slightly modified) chi-squared algorithm applied to them, to stretch the differences from the expected average. Though as will be explained, the results of this are minimal as the results are very "black & white" (a pun which will also be explained.) So, for each randomly chosen input, the CRC is calculated. Then, each bit of the input in turn is toggled, the CRC32 recalculated, and the differences between the two CRCs is are counted and accumulated for the number of trials. And the chi-squared is applied to emphasis the differences from the expected averages. Thus, for each pixel, a mid-grey color (matching the border, represents the expected 50% of the pixels change ideal. A pure black pixel indicates that that pixel never changed. A pure white pixel indicates that that pixel always changed. And, as I said above, the results are very black & white. And they indicate that CRC32 makes for a very poor hashing algorithm, because with some columns are entirely white regardless of the input, and other entirely black, it means that some hash values will be way over used and other never. The mid-grey and black vertical strips on the right hand third of the image is the result of performing the same processing upon a 16-bit XOR-fold of the CRC32s produced above. The absence of white would seem to indicate that this has improved the result, but the presence of pure black just goes to show that you cannot generate bits using XOR, only dilute them. Upshot: XOR folding is bad for hashes. Truncation is okay because it uses the bits available without diluting them. And that brings me back to your collisions graphs. I think you got your math wrong. I simply cannot believe that you should expect 1890 collisions from 16384 randomly chosen samples from a domain of 2*32 possibilities. Birthday paradox or not, that is way, way too high a collision rate. Way too high. ~~Hmm. I uploaded the image and the size displayed on the update page conformed it was seen, but for 20 minutes now, it fails to be rendered back to my browser. So you may not be able to see it.~~ However, if you download the code below, save it as CRCtest.pl and run it with the command line: `perl -s CRC32test.pl -TRIALS=1e5 -MAP=BW`, it will be reproduced for you locally, though you will probably need to arrange to load it into an image app yourself (if you use nix), and then stretch it a little to see it clearly. Updated to pack the CRC. Image updated. #! perl -slw use strict; use Math::Random::MT qw[ rand srand ]; use String::CRC32; use GD; $\|++; my %map = ( 255 => sub{ return ( 0, 0, $_[0] * 255 ) }, 510 => sub{ return ( 0, $_[0]255, 255 ) }, 765 => sub{ return ( 0, 255, (1-$_[0])255 ) }, 1020 => sub{ return ( $_[0]255, 255, 0 ) }, 1275 => sub{ return ( 255, (1-$_[0])255, 0 ) }, 1530 => sub{ return ( 255, 0, $_[0]255 ) }, 1785 => sub{ return ( 255, $_[0]255, 255 ) }, ); my @map = sort{ $a <=> $b } keys %map; sub colorRamp1785 { my( $v, $vmin, $vmax ) = @_; $v = $vmax if $v > $vmax; $v = $vmin if $v < $vmin; $v = ( $v - $vmin ) / ( $vmax - $vmin ); ( $v * 1785 ) < $_ and return rgb2n( $map{ $_ }->( $v ) ) for @map +; } our $SRAND; srand( $SRAND ) if $SRAND; our $TRIALS \|\|= 1e3; my $HALF = $TRIALS / 2; our $RANGE \|\|= $TRIALS /4; our $MAP \|\|= 'C'; my $colorRamp = $MAP eq 'BW' ? \&colorRampBW : $MAP eq 'C' ? \&colorRamp1020 : $MAP eq 'CE' ? \&colorRamp1785 : die "-MAP= [BW\|C\|CE]\n"; my @crcChanges = map [ (0) x 32 ], 1 .. 128; my @xorChanges = map [ (0) x 16 ], 1 .. 128; for ( 0 .. $TRIALS ) { printf "\r$_\t" unless $_ % ($TRIALS/10); ## Pick a random message; my $message = pack 'V4', map rand( 232 ), 1 .. 4; ## Calculate reference CRC ## my $CRC = crc32( $message ); my $CRC = pack 'V', crc32( $message ); ## XOR-fold the reference my $XOR = substr( $CRC, 0, 4 ) ^ substr( $CRC, 4, 4 ); for my $bit ( 0 .. 127 ) { my $copy = $message; ## toggle this bit in the message vec( $copy, $bit, 1 ) ^= 1; ## Calculate new CRC ## my $newCRC = crc32( $copy ); my $newCRC = pack 'V', crc32( $copy ); ## Isolate the bits that changed my $changed = $CRC ^ $newCRC; ## Accumulate the changes vec( $changed, $_, 1 ) and ++$crcChanges[ $bit ][ $_ ] for 0 . +. 31; ## XOR-fold the new md5 my $newXOR = substr( $newCRC, 0, 4 ) ^ substr( $newCRC, 4, 4 ) +; ## Isolate the changes $changed = $XOR ^ $newXOR; ## And accumulate them vec( $changed, $_, 1 ) and ++$xorChanges[ $bit ][ $_ ] for 0 . +. 15; } } print "\n"; my $img = GD::Image->new( 51, 130, 1 ); $img->filledRectangle( 0, 0, 51, 130, rgb2n( (128) x 3 ) ); my @maxmins = ( 232, 0 )x 2; for my $inBit ( 0 .. 127 ) { for my $outBit ( 0 .. 31 ) { my $chi = $crcChanges[ $inBit ][ $outBit ] - $HALF; $chi = $chi*2 ( $chi <=> 0 ); $chi /= $TRIALS; $maxmins[ 0 ] = $chi if $maxmins[ 0 ] > $chi; $maxmins[ 1 ] = $chi if $maxmins[ 1 ] < $chi; $img->setPixel( $outBit+1, $inBit+1, $colorRamp->( $chi, -$RANGE, $RANGE ) ); next if $outBit > 15; $chi = $xorChanges[ $inBit ][ $outBit ] - $HALF; $chi = $chi*2 ( $chi <=> 0 ); $chi /= $TRIALS; $maxmins[ 2 ] = $chi if $maxmins[ 2 ] > $chi; $maxmins[ 3 ] = $chi if $maxmins[ 3 ] < $chi; $img->setPixel( $outBit+34, $inBit+1, $colorRamp->( $chi, -$RANGE, $RANGE ) ); } } print "$RANGE : @maxmins"; my $fname = "crc32Xor.$MAP.png"; open PNG, '>:raw', $fname or die $!; print PNG $img->png; close PNG; system 1, $fname; sub rgb2n{ unpack 'N', pack 'CCCC', 0, @_ } sub colorRampBW { my( $v, $vmin, $vmax ) = @_; $v = $vmax if $v > $vmax; $v = $vmin if $v < $vmin; $v -= $vmin; $v /= $vmax - $vmin; $v = 255; return rgb2n( ( $v ) x 3 ); } sub colorRamp1020 { my( $v, $vmin, $vmax ) = @_; my( $r, $g, $b ) = (1) x 3; $v = $vmax if $v > $vmax; $v = $vmin if $v < $vmin; ## $v = $vmax + $vmin - $v; my $dv = $vmax - $vmin; if( $v < ( $vmin + 0.25$dv ) ) { $r = 0; $g = 4 * ($v - $vmin) / $dv; } elsif( $v < ( $vmin + 0.5 * $dv ) ) { $r = 0; $b = 1 + 4 * ($vmin + 0.25 * $dv - $v) / $dv; } elsif( $v < ( $vmin + 0.75 * $dv ) ) { $r = 4 * ($v - $vmin - 0.5 * $dv) / $dv; $b = 0; } else { $g = 1 + 4 * ($vmin + 0.75 * $dv - $v) / $dv; $b = 0; } return rgb2n( map int( $_ * 255), $r, $g, $b ); } [download] Note:There are various other command line arguments available, including color ramping. If you cannot work out how to use them, ask. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^7: 64-bit digest algorithms by gone2015 (Deacon) on Nov 22, 2008 at 03:05 UTC
Re^6: 64-bit digest algorithms by BrowserUk (Patriarch) on Nov 20, 2008 at 23:28 UTC
Disclaimer 1: I've no idea who you are or what your expertise level is. You could even be Knuth himself, though the fact that his 'reference' hashing algorithm turns out to have pretty abismal performance shows that even the very best can make errors. Disclaimer 2: I'm not a mathematician, which is why I seek out the most authoritative reference material I can find. And the references I gave are it. That said, I'm going to challange the effectiveness of your test methodology. By feeding your tests with 8..12 character strings of the full range of byte values (0..255) you are creating optimal conditions for the algorithms to produce optimal results. With randomly chosen bytes( 0..255 ), you have ensured that the inputs to the algorithms have a statistically even chance of all bits (0..7) of the bytes being set with exactly even odds for the frequencies of all bits. Ie. `for my $n ( 0 .. 255 ) { vec( chr( $n ), $_, 1 ) and $h{ $_ }++ for 0 .. 7 };; print "$_ : $h{ $_ }" for 0 .. 7;; 0 : 128 1 : 128 2 : 128 3 : 128 4 : 128 5 : 128 6 : 128 7 : 128` [download] In fact, if you used 4-byte strings of bytes( rand( 0..255 ) ), no algorithm is necessary, because just treating those 4 bytes as an integer forms a 32-bit "perfect hash". Increasing the length to 8 or 12-bytes does nothing to alter the probabilities, they remain even across the board assuming infinite trials. Adding 9, 10, and 11 characters strings of bytes( rand( 0 .. 255 ) ), skews the mix a little, but the skewing is probably lost in the noise. With an average of 10-char strings of 0..255, the total population is 256^10 = 1208925819614629174706176, which mean that even your first test (where the sample set was 2^20 ), is only sampling 8.6736173798840354720596224069595e-17% of the total population. That's just 0.000000000000000086%! I think if polsters used such a small sample, their obligatory small print caveat of +/-3% error would have to be more like +/-99.9% error. For the other two tests where the sample size is just 16384, that drops to just 0.0000000000000000013%. Even Monte Carlo simulations require a far higher sample size than this to achieve accuracy. For example 10e6 samples of the Monte Carlo approximation of PI doesn't achieve any great accuracy: `$n=0; rand()2+rand()2 <=1 and ++$n, printf "\r$_: %.10f", 4( $n / $_ ) for 1 .. 10e6;; 10000000: 3.1419312000` [download] And that is a 5.4210108624275221700372640043497e-11% sample of the population size() which is several orders of magnitide higher than yours above. Using the 32-bit Math::Random::MT`::rand()` to pick points in a 1.0x1.0 square, gives a finite range of possibilities: 2^32x2^32 = 2^64; 10^6 / 2^64 100 == 5.4e-11% The problems that arise with hashing algorithms producing high rates of collisions occur when their failure to produce good dispersal (due to funnelling), is exacerbated by input that does not provide a wide range of bits in the input. Ie. When the input strings consist entirely of some restricted subset of their full range of possibilities. For example, when the keys being hashed consist entirely of upper case letters; or just numeric digits. Try performing your test 1 with strings that range from '0' .. '99999999' and see what results you get? And to achieve some level of statistical confidence in the results, sample a much higher proportion of the inputs. Using just 0-8 characters string s of just digits, you should be able to exercise the entire population quite quickly. You should also favour the FNV-1a algorithm over the FNV-1 which is known to be less good with short inputs. (That just means doing the XOR before the multiple rather than after!) For completness add Jenkin's Lookup3 algorithm to your tests. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^7: 64-bit digest algorithms by gone2015 (Deacon) on Nov 22, 2008 at 02:48 UTC
Re^8: 64-bit digest algorithms by BrowserUk (Patriarch) on Nov 22, 2008 at 05:47 UTC
Some notes below your chosen depth have not been shown here