in reply to Re: 64-bit digest algorithms
in thread 64-bit digest algorithms

The application has nothing to do with either cryptography or security. Essentially I want a hashing function with (far) less risk/frequency of collisions than a 32-bit hashing function, and far less space utilisation that a 128-bit digest.

Assuming linear distribution (and unlimited space), a 64-bit hash has only a 0.000000023% chance of collisions relative to a 32-bit hash. but uses half as much space as MD5.

I also considered 48-bit, but they're harder to calculate; and 53-bit (because of the possibility of using doubles for the calculations), but most hashing algorithm using shifting and that doesn't work with FP. And both are as rare as rocking horse doo-doo.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^3: 64-bit digest algorithms
by dHarry (Abbot) on Nov 13, 2008 at 09:43 UTC

    A list of Hash functions. There area a few 64 bits alternatives. Maybe elf64?

    HTH

      Many thanks for this link.

      After browsing around and reading a lot of very interesting information, it led me to FNV-1 & FNV-1a which seem to have been explicitly designed for my purposes with particular attention paid to the problems of clustering and collisions.

      This seems to be my best choice.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Then again, maybe the FNV isn't such a good choice after all :( Though it only seems to produce the bad results when XOR-folding is used to truncate it, and I don't need that.

        At this point, I'm still looking.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: 64-bit digest algorithms
by gone2015 (Deacon) on Nov 13, 2008 at 10:26 UTC

    When I was playing around recently I used CRC32, taken once forwards across the input and a second time backwards across it. This certainly increased the effective space of the hash -- my sums is not up to any assertion about whether that gives a full 64 bits of "entropy".

    There are also two other respectable 32 bit CRCs -- avoiding the need to reverse the input.

    There is also CRC64 itself, but a quick poke at CPAN doesn't reveal an implementation (except buried in some "Bio..." stuff). Wouldn't be too difficult to bang out a table driven CRC64.

    You don't want anything crypto-secure, so a CRC could well be sufficient, and a good implementation will run like stink compared to any of the crypto-digest stuff.

      There is an arbitrary-output-size hash function family called RadioGatún...
      []s, HTH, Massa (κς,πμ,πλ)
        why you don't use cyclic redundancy check algorithms for hashing purposes.

        This surprised me. So I've been running a few tests. I've dug out my copy of Knuth, and refreshed my memory on how you calculate the probability of 'c' collisions if you throw 'm' balls at random into 'n' bins. Then I ran three tests.

        Each test uses randomly constructed 8-12 byte strings, each byte being 0..255 -- the distribution of string lengths and byte values should be even, assuming my 64-bit random number generator is good (which I believe it is).

        The tests were:

        1. take 1M (2^20) strings and count the number of collisions when using (a) CRC-32 and (b) FNV (see below). This was done 500 times, and the resulting distribution of numbers of collisions plotted against the expected distribution. See: Test-1.

        2. take 16,384 (2^14) strings and count the number of collisions when using (a) CRC-32 and (b) FNV, and XORing the MS and LS halves to give a 16-bit hash. This was done 1,000 times, and the resulting distribution of numbers of collisions plotted against the expected distribution. See: Test-2.

        3. as (2) but using (a) the LS and (b) the MS halves of CRC-32 to give a 16-bit hash. This was done 1,000 times. See: Test-3.

        Visual inspection of these plots does not suggest that CRC-32 is clearly distinguishable from uniform randomness. The middle part of the FNV distributions is a little taller, suggesting it is a little less variable in it's results. The full 32-bit CRC-32 hash leans a little towards a few more collisions than expected, a little, perhaps...

        I'm not sure whether to apply chi-squared or Kolmogorov-Smirnov to the data I have... but the plots don't suggest to me that CRC-32 is a bad hash.

        If anyone can suggest a better test or a better statistical approach, or has contrary results, I'd love to hear !


        The FNV hash is taken from http://bretm.home.comcast.net/~bretm/hash/6.html. What it comes down to is:

        U32 FNV_Hash(U8 * p, int l) { U32 hash = 2166136261 ; while (l--) { hash = (hash * 16777619) ^ *p++ ; } ; return hash ; } ;

Re^3: 64-bit digest algorithms
by ikegami (Patriarch) on Nov 13, 2008 at 18:21 UTC
    Doubles have 52 bits. The 53rd bit is the implied leading "1".

    Update: As pointed out by oshalla, the implied bit does count. As pointed out by BrowserUk, I forgot about the sign bit. So there's 54 bits of precision for signed integers.

      No! Doubles have 64-bits.

      Of which:

      • 1 bit is used for the sign;
      • 11 bits are used for the exponent;
      • 52 bits are used for the mantissa;
      • 1 extra bit is implied in the mantissa by the specification;

      With the result being that a 64-bit IEEE-754 double precision floating point value can act as a 53-bit signed integer for some purposes.

      If you're going to try to be a pedant, rather than contribute something useful--at least be right.

        can act as a 53-bit signed integer for some purposes.

        To be even more pedantic... :-)

        ...that would be a 53 bit unsigned integer, or a 54-bit sign and magnitude signed integer (using one bit for the sign); equivalent to a 54-bit 1's complement signed integer.

        Since 1's complement machines are as rare as rocking horse fumet, one might worry about the -0x80....0 minimum value. Happily that is quite comfortably accommodated in floating point form.

        The -0 value is generally treated as zero, so that shouldn't intrude.

        So, for practical purposes a 64-bit IEEE-754 double precision floating point value can act as a 54-bit signed integer (with the usual 2's complement -ve values).

        So:

        for my $n (53..55) { my $max = 2**($n-1) - 1 ; my $min = -(2**($n-1)) ; printf "%4d: -%s..%s\n", $n, show($min), show($max) ; } ; sub show { my ($x) = @_ ; my $ms = int(abs($x)/2**32) ; my $ls = abs($x) - ($ms * 2**32) ; sprintf '0x%04X_%04X_%04X_%04X', ($ms >> 16), ($ms & 0xFFFF), ($ls >> 16), ($ls & 0xFFFF) ; } ;
        gives:
            53: -0x0010_0000_0000_0000..0x000F_FFFF_FFFF_FFFF
            54: -0x0020_0000_0000_0000..0x001F_FFFF_FFFF_FFFF
            55: -0x0040_0000_0000_0000..0x0040_0000_0000_0000
        
        showing that 53-bit (2's complement) signed integer is on the small side, 55-bit is to big, but 54-bits is just right.


        For extra credit, those with 64-bit integers can consider (a) why:

        my $z = 0x0FED_CBA9_8765_4321 ; my $d = 9 ; printf " a = %20s * %d + i, for i = 0..%d\n", "$z", $d, $d ; # 12: 12345678901234567890 : +2345 print " i: a / $d : Error\n" ; foreach my $i (0..$d) { my $a = ($z * $d) + $i ; my $q = $a / $d ; printf " %2d: %20s : %+5d\n", $i, "$q", $q - ($z + int($i/$d)) ; } ;
        gives:
         a =  1147797409030816545 * 9 + i, for i = 0..9
          i:                a / 9 : Error
          0:  1147797409030816545 :    +0
          1: 1.14779740903082e+18 :  +128
          2: 1.14779740903082e+18 :  +128
          3: 1.14779740903082e+18 :  +128
          4: 1.14779740903082e+18 :  +128
          5: 1.14779740903082e+18 :  +128
          6: 1.14779740903082e+18 :  +128
          7: 1.14779740903082e+18 :  +128
          8: 1.14779740903082e+18 :  +128
          9:  1147797409030816546 :    +0
        
        and (b) whether this could be fixed, and finally (c) whether round-to-even is a Good Thing, with particular reference to this case.

        Bug #53784 refers.