cmv has asked for the wisdom of the Perl Monks concerning the following question:

Folks-

I have space in an existing data structure, for 12 decimal digits, that I wish to use as a digest (checksum) to uniqely identify some data.

As I understand how digests work, the more digits you generate, the less likely it is to have 2 different pieces of data causing the exact same digest result to be generated.

If this is true, then I would like to use a digest algorithm that will generate exactly 12 decimal digits (zero padded). Does one exist? I can't seem to find one.

I would use Digest::CRC (16/32-bit), but they don't generate enough digits. The Digest::MD5 algorithms generate too many. Is there anything inbetween, or something that I can customize?

Thanks

-Craig

  • Comment on Digest (checksum) Algorithm for 12 Decimal Digits?

Replies are listed 'Best First'.
Re: Digest (checksum) Algorithm for 12 Decimal Digits?
by syphilis (Archbishop) on Oct 05, 2007 at 16:19 UTC
    Hi,

    Pretty much as oha said. To state the obvious, there's less chance of a collision if you use (eg) twelve base 64 digits, than if you use twelve base 2 digits ... so, if you use Digest::MD5, get the 'md5_base64' and then truncate.

    Oh ... wait ... you specified "decimal" digits. Any particular reason for that ?

    Cheers,
    Rob
      Rob-

      The digits must decimal because some downstream code that is out of my control, requires it. :-(

      I think I understand what oha is trying to tell me, but I still have a nagging doubt. Let's see if I can explain this doubt rationally, without having any evidence or support for it :-)

      I like the "bucket" analogy, and I agree that a good algorithm should be able to evenly distribute among all the buckets available to it in equal density.

      What I'm not so sure about is that if I remove some bits from the result, does that mean I've reduced the number of buckets and preserved the algorithms bucket-distribution-equality? For some reason I think this answer is no (remember - I have no evidence or facts to back me up here... it's just a feeling).

      Also, does it matter which bits I dump? Most significant? Least significant? Random bunny-hop among all the bits?

      Many thanks syphilis and oha!

      -Craig

        does that mean I've ... preserved the algorithms bucket-distribution-equality?

        Yes - so long as you're ignoring the same bit positions every time.

        Let's assume you need to keep 4 digits of an 8 digit number - and you decide to keep the first 4 digits. That would be ok, so long as the number "123456" was saved as "12" (0012) and not "1234". To avoid that sort of trap, it's probably best to save the *trailing* digits, rather than the leading digits. Each 4-digit "abcd" will then represent 10000 different 8-digit numbers - and the distribution is unaffected.

        Note that if, in the above example, "123456" was being saved as "1234", then "1234567" would also be saved as "1234". They've reduced to the same number - though they ought to have been reduced to different numbers ("12" and "123" respectively). I think that sort of treatment would introduce bias.

        Cheers,
        Rob
Re: Digest (checksum) Algorithm for 12 Decimal Digits?
by oha (Friar) on Oct 05, 2007 at 15:51 UTC
    use a 16 bit algorithm, and use only 12 of them

    Oha

    update: sorry i re-read. use any algorithm which give you more information, and truncate this information to the data you need.

      Doesn't it mean that if I truncate data, I'll be exposing myself to higher risk of having different data generating the exact same digest value?

      I believe that is true, but maybe the risk is still so small as to be negligable!?!

        suppose you have a 3 bit hash function: the best is if all the value populate the 8 possibilities in equal density. now you remove 1 bit. it means that you have only 4 buckets now, and being the result of the union of two who are equal density, they are equal density.
        IOW, if the hash function you used was good, the result is good.

        Oha

Re: Digest (checksum) Algorithm for 12 Decimal Digits?
by salva (Canon) on Oct 05, 2007 at 21:24 UTC
    You can fit up to 39 bits into 12 decimal digits.

    Get the 39 lower bits for the MD5 of your data, convert them to a number and then to a string:

    use Digest::MD5 qw(md5); my $md5 = md5($data); my $end = substr($md5, -5) & "\x7f\xff\xff\xff\xff"; my $acu = 0; $acu = $acu * 256 + ord $_ for split(//, $end); printf "%012.0f\n", $acu;
      ++ a good and easy example, and a good use of & in string context,

      we must note that we have a bit more then half the possible results we could get with a 12 digit solution. (first digit will be 0..5)
      also, we must note that we require 64 bit algebra.

      iff we have 64 bit algebra, we could return a modulo, there is some more population on lower buckets but it's less then 0.1%, on the other hand we would have about the double of bucket, giving the OP less risks of clashing.

      without 64 bit algebra, things get a little more complex:

      my $md5 = ...; my $lo = 0; my $hi = 0; for(split //, $md5) { $lo = $lo*256 + ord; $hi = $hi*256 + $lo / 1_000_000; $lo = $lo % 1_000_000; $hi = $hi % 1_000_000; } return $hi.$lo;
        64 bit algebra is not really required.

        The double floats used by perl internally can represent 53 bit integers on most CPU architectures, enough to store the 12 decimal digits:

        my $md5 = ...; my $dd = 0; $dd = ($dd * 256 + ord) % 1e12 for split //, $md5; print "$dd\n";
        (actually, the biggest number that can appear in this algorithm is 2.56E+14, that can be represented in 48 bits)
Re: Digest (checksum) Algorithm for 12 Decimal Digits?
by cmv (Chaplain) on Oct 08, 2007 at 15:23 UTC
    You people are fantastic! Many thanks for all the help. Not only do you expertly answer the question, but you teach what is needed, to fill the gaps in my tiny little brain.

    Specific thanks:
    oha - Initial response, understandable "bucket" analogy, non-64-bit algebra example.
    syphilis - oha support, great bit-position explanation.
    salva - great code example, 64-bit algebra discussion.

    You've helped out more than you'll know, and I bow to each of you.

    A final bow to the perlmonks site in general, which has created a rose of a website in the abandoned-car-park of the internet.

    -Craig