in reply to Re^5: 64-bit digest algorithms
in thread 64-bit digest algorithms

This is a follow-up/in addition to my previous reply. I finally got around to adapting some code I wrote to assess md5 to access CRC32, The result is this image

This image plots the state affect of changing bits in a randomly selected 128-bit input to CRC32 (plotted vertically), against

  1. the 32-bits in the output (plotted horizontally in the left 2/3rds of the image) and:
  2. the 16-bits resulting from XORing the upper and lower halves of the CRC32 together (plotted horizontally in the right third of the image).
  3. There is a "1-pixel" band of mid-grey surrounding the image, and vertically between the two sets of data.
  4. The image has been stretched x4 both horizontally and vertically to make discerning the individual pixels easier.
  5. The image was generated from accumulating counts of changes of pixels in the (two) outputs for each change of state of the 128-bits in the inputs, for each of 1 million 16-byte randomly chosen inputs.

    Although, the number of trials is in effect irrelevant as it doesn't vary (at all) from the state reached after just 1000 trials.

  6. The totals of the state changes accumulated, have had (slightly modified) chi-squared algorithm applied to them, to stretch the differences from the expected average.

    Though as will be explained, the results of this are minimal as the results are very "black & white" (a pun which will also be explained.)

So, for each randomly chosen input, the CRC is calculated. Then, each bit of the input in turn is toggled, the CRC32 recalculated, and the differences between the two CRCs is are counted and accumulated for the number of trials. And the chi-squared is applied to emphasis the differences from the expected averages.

Thus, for each pixel, a mid-grey color (matching the border, represents the expected 50% of the pixels change ideal. A pure black pixel indicates that that pixel never changed. A pure white pixel indicates that that pixel always changed. And, as I said above, the results are very black & white. And they indicate that CRC32 makes for a very poor hashing algorithm, because with some columns are entirely white regardless of the input, and other entirely black, it means that some hash values will be way over used and other never.

The mid-grey and black vertical strips on the right hand third of the image is the result of performing the same processing upon a 16-bit XOR-fold of the CRC32s produced above. The absence of white would seem to indicate that this has improved the result, but the presence of pure black just goes to show that you cannot generate bits using XOR, only dilute them. Upshot: XOR folding is bad for hashes. Truncation is okay because it uses the bits available without diluting them.

And that brings me back to your collisions graphs. I think you got your math wrong. I simply cannot believe that you should expect 1890 collisions from 16384 randomly chosen samples from a domain of 2**32 possibilities. Birthday paradox or not, that is way, way too high a collision rate. Way too high.

Hmm. I uploaded the image and the size displayed on the update page conformed it was seen, but for 20 minutes now, it fails to be rendered back to my browser. So you may not be able to see it.

However, if you download the code below, save it as CRCtest.pl and run it with the command line: perl -s CRC32test.pl -TRIALS=1e5 -MAP=BW, it will be reproduced for you locally, though you will probably need to arrange to load it into an image app yourself (if you use *nix), and then stretch it a little to see it clearly. Updated to pack the CRC. Image updated.

#! perl -slw use strict; use Math::Random::MT qw[ rand srand ]; use String::CRC32; use GD; $|++; my %map = ( 255 => sub{ return ( 0, 0, $_[0] * 255 ) }, 510 => sub{ return ( 0, $_[0]*255, 255 ) }, 765 => sub{ return ( 0, 255, (1-$_[0])*255 ) }, 1020 => sub{ return ( $_[0]*255, 255, 0 ) }, 1275 => sub{ return ( 255, (1-$_[0])*255, 0 ) }, 1530 => sub{ return ( 255, 0, $_[0]*255 ) }, 1785 => sub{ return ( 255, $_[0]*255, 255 ) }, ); my @map = sort{ $a <=> $b } keys %map; sub colorRamp1785 { my( $v, $vmin, $vmax ) = @_; $v = $vmax if $v > $vmax; $v = $vmin if $v < $vmin; $v = ( $v - $vmin ) / ( $vmax - $vmin ); ( $v * 1785 ) < $_ and return rgb2n( $map{ $_ }->( $v ) ) for @map +; } our $SRAND; srand( $SRAND ) if $SRAND; our $TRIALS ||= 1e3; my $HALF = $TRIALS / 2; our $RANGE ||= $TRIALS /4; our $MAP ||= 'C'; my $colorRamp = $MAP eq 'BW' ? \&colorRampBW : $MAP eq 'C' ? \&colorRamp1020 : $MAP eq 'CE' ? \&colorRamp1785 : die "-MAP= [BW|C|CE]\n"; my @crcChanges = map [ (0) x 32 ], 1 .. 128; my @xorChanges = map [ (0) x 16 ], 1 .. 128; for ( 0 .. $TRIALS ) { printf "\r$_\t" unless $_ % ($TRIALS/10); ## Pick a random message; my $message = pack 'V4', map rand( 2**32 ), 1 .. 4; ## Calculate reference CRC ## my $CRC = crc32( $message ); my $CRC = pack 'V', crc32( $message ); ## XOR-fold the reference my $XOR = substr( $CRC, 0, 4 ) ^ substr( $CRC, 4, 4 ); for my $bit ( 0 .. 127 ) { my $copy = $message; ## toggle this bit in the message vec( $copy, $bit, 1 ) ^= 1; ## Calculate new CRC ## my $newCRC = crc32( $copy ); my $newCRC = pack 'V', crc32( $copy ); ## Isolate the bits that changed my $changed = $CRC ^ $newCRC; ## Accumulate the changes vec( $changed, $_, 1 ) and ++$crcChanges[ $bit ][ $_ ] for 0 . +. 31; ## XOR-fold the new md5 my $newXOR = substr( $newCRC, 0, 4 ) ^ substr( $newCRC, 4, 4 ) +; ## Isolate the changes $changed = $XOR ^ $newXOR; ## And accumulate them vec( $changed, $_, 1 ) and ++$xorChanges[ $bit ][ $_ ] for 0 . +. 15; } } print "\n"; my $img = GD::Image->new( 51, 130, 1 ); $img->filledRectangle( 0, 0, 51, 130, rgb2n( (128) x 3 ) ); my @maxmins = ( 2**32, 0 )x 2; for my $inBit ( 0 .. 127 ) { for my $outBit ( 0 .. 31 ) { my $chi = $crcChanges[ $inBit ][ $outBit ] - $HALF; $chi = $chi**2 * ( $chi <=> 0 ); $chi /= $TRIALS; $maxmins[ 0 ] = $chi if $maxmins[ 0 ] > $chi; $maxmins[ 1 ] = $chi if $maxmins[ 1 ] < $chi; $img->setPixel( $outBit+1, $inBit+1, $colorRamp->( $chi, -$RANGE, $RANGE ) ); next if $outBit > 15; $chi = $xorChanges[ $inBit ][ $outBit ] - $HALF; $chi = $chi**2 * ( $chi <=> 0 ); $chi /= $TRIALS; $maxmins[ 2 ] = $chi if $maxmins[ 2 ] > $chi; $maxmins[ 3 ] = $chi if $maxmins[ 3 ] < $chi; $img->setPixel( $outBit+34, $inBit+1, $colorRamp->( $chi, -$RANGE, $RANGE ) ); } } print "$RANGE : @maxmins"; my $fname = "crc32Xor.$MAP.png"; open PNG, '>:raw', $fname or die $!; print PNG $img->png; close PNG; system 1, $fname; sub rgb2n{ unpack 'N', pack 'CCCC', 0, @_ } sub colorRampBW { my( $v, $vmin, $vmax ) = @_; $v = $vmax if $v > $vmax; $v = $vmin if $v < $vmin; $v -= $vmin; $v /= $vmax - $vmin; $v *= 255; return rgb2n( ( $v ) x 3 ); } sub colorRamp1020 { my( $v, $vmin, $vmax ) = @_; my( $r, $g, $b ) = (1) x 3; $v = $vmax if $v > $vmax; $v = $vmin if $v < $vmin; ## $v = $vmax + $vmin - $v; my $dv = $vmax - $vmin; if( $v < ( $vmin + 0.25*$dv ) ) { $r = 0; $g = 4 * ($v - $vmin) / $dv; } elsif( $v < ( $vmin + 0.5 * $dv ) ) { $r = 0; $b = 1 + 4 * ($vmin + 0.25 * $dv - $v) / $dv; } elsif( $v < ( $vmin + 0.75 * $dv ) ) { $r = 4 * ($v - $vmin - 0.5 * $dv) / $dv; $b = 0; } else { $g = 1 + 4 * ($vmin + 0.75 * $dv - $v) / $dv; $b = 0; } return rgb2n( map int( $_ * 255), $r, $g, $b ); }

Note:There are various other command line arguments available, including color ramping. If you cannot work out how to use them, ask.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^7: 64-bit digest algorithms
by gone2015 (Deacon) on Nov 22, 2008 at 03:05 UTC

    I'll read the rest again and try stuff in the morning... In the meantime:

    And that brings me back to your collisions graphs. I think you got your math wrong. I simply cannot believe that you should expect 1890 collisions from 16384 randomly chosen samples from a domain of 2**32 possibilities. Birthday paradox or not, that is way, way too high a collision rate. Way too high.

    The 1890 collisions is when mungeing the 32-bit hashes into 16-bit hashes -- so it's 16,364 random thingies in 65,536 bins... So roughly speaking:

    1. after 4,096 tosses ~1/16 of the bins are occupied...

    2. ...so the next 4,096 tosses ~4,096 * 1/16 = 256 collisions, and ~ 2/16 of the bins are occupied...

    3. ...so the next 4,096 tosses ~4,096 * 2/16 = 512 collisions, and ~ 3/16 of the bins are occupied...

    4. ...so the next 4,096 tosses ~4,096 * 3/16 = 768 collisions...

    ...total 1,536 -- accepting that this is an underestimate.