in reply to Digest (checksum) Algorithm for 12 Decimal Digits?

Hi,

Pretty much as oha said. To state the obvious, there's less chance of a collision if you use (eg) twelve base 64 digits, than if you use twelve base 2 digits ... so, if you use Digest::MD5, get the 'md5_base64' and then truncate.

Oh ... wait ... you specified "decimal" digits. Any particular reason for that ?

Cheers,
Rob
  • Comment on Re: Digest (checksum) Algorithm for 12 Decimal Digits?

Replies are listed 'Best First'.
Re^2: Digest (checksum) Algorithm for 12 Decimal Digits?
by cmv (Chaplain) on Oct 05, 2007 at 16:36 UTC
    Rob-

    The digits must decimal because some downstream code that is out of my control, requires it. :-(

    I think I understand what oha is trying to tell me, but I still have a nagging doubt. Let's see if I can explain this doubt rationally, without having any evidence or support for it :-)

    I like the "bucket" analogy, and I agree that a good algorithm should be able to evenly distribute among all the buckets available to it in equal density.

    What I'm not so sure about is that if I remove some bits from the result, does that mean I've reduced the number of buckets and preserved the algorithms bucket-distribution-equality? For some reason I think this answer is no (remember - I have no evidence or facts to back me up here... it's just a feeling).

    Also, does it matter which bits I dump? Most significant? Least significant? Random bunny-hop among all the bits?

    Many thanks syphilis and oha!

    -Craig

      does that mean I've ... preserved the algorithms bucket-distribution-equality?

      Yes - so long as you're ignoring the same bit positions every time.

      Let's assume you need to keep 4 digits of an 8 digit number - and you decide to keep the first 4 digits. That would be ok, so long as the number "123456" was saved as "12" (0012) and not "1234". To avoid that sort of trap, it's probably best to save the *trailing* digits, rather than the leading digits. Each 4-digit "abcd" will then represent 10000 different 8-digit numbers - and the distribution is unaffected.

      Note that if, in the above example, "123456" was being saved as "1234", then "1234567" would also be saved as "1234". They've reduced to the same number - though they ought to have been reduced to different numbers ("12" and "123" respectively). I think that sort of treatment would introduce bias.

      Cheers,
      Rob