Re: Digest (checksum) Algorithm for 12 Decimal Digits?
by syphilis (Archbishop) on Oct 05, 2007 at 16:19 UTC
|
Hi,
Pretty much as oha said. To state the obvious, there's less chance of a collision if you use (eg) twelve base 64 digits, than if you use twelve base 2 digits ... so, if you use Digest::MD5, get the 'md5_base64' and then truncate.
Oh ... wait ... you specified "decimal" digits. Any particular reason for that ?
Cheers, Rob | [reply] |
|
|
Rob-
The digits must decimal because some downstream code that is out of my control, requires it. :-(
I think I understand what oha is trying to tell me, but I still have a nagging doubt. Let's see if I can explain this doubt rationally, without having any evidence or support for it :-)
I like the "bucket" analogy, and I agree that a good algorithm should be able to evenly distribute among all the buckets available to it in equal density.
What I'm not so sure about is that if I remove some bits from the result, does that mean I've reduced the number of buckets and preserved the algorithms bucket-distribution-equality? For some reason I think this answer is no (remember - I have no evidence or facts to back me up here... it's just a feeling).
Also, does it matter which bits I dump? Most significant? Least significant? Random bunny-hop among all the bits?
Many thanks syphilis and oha!
-Craig
| [reply] |
|
|
does that mean I've ... preserved the algorithms bucket-distribution-equality?
Yes - so long as you're ignoring the same bit positions every time.
Let's assume you need to keep 4 digits of an 8 digit number - and you decide to keep the first 4 digits. That would be ok, so long as the number "123456" was saved as "12" (0012) and not "1234". To avoid that sort of trap, it's probably best to save the *trailing* digits, rather than the leading digits. Each 4-digit "abcd" will then represent 10000 different 8-digit numbers - and the distribution is unaffected.
Note that if, in the above example, "123456" was being saved as "1234", then "1234567" would also be saved as "1234". They've reduced to the same number - though they ought to have been reduced to different numbers ("12" and "123" respectively). I think that sort of treatment would introduce bias.
Cheers, Rob
| [reply] |
Re: Digest (checksum) Algorithm for 12 Decimal Digits?
by oha (Friar) on Oct 05, 2007 at 15:51 UTC
|
use a 16 bit algorithm, and use only 12 of them
Oha
update: sorry i re-read. use any algorithm which give you more information, and truncate this information to the data you need. | [reply] |
|
|
| [reply] |
|
|
suppose you have a 3 bit hash function: the best is if all the value populate the 8 possibilities in equal density. now you remove 1 bit. it means that you have only 4 buckets now, and being the result of the union of two who are equal density, they are equal density.
IOW, if the hash function you used was good, the result is good.
Oha
| [reply] |
Re: Digest (checksum) Algorithm for 12 Decimal Digits?
by salva (Canon) on Oct 05, 2007 at 21:24 UTC
|
You can fit up to 39 bits into 12 decimal digits.
Get the 39 lower bits for the MD5 of your data, convert them to a number and then to a string:
use Digest::MD5 qw(md5);
my $md5 = md5($data);
my $end = substr($md5, -5) & "\x7f\xff\xff\xff\xff";
my $acu = 0;
$acu = $acu * 256 + ord $_ for split(//, $end);
printf "%012.0f\n", $acu;
| [reply] [d/l] |
|
|
++ a good and easy example, and a good use of & in string context,
we must note that we have a bit more then half the possible results we could get with a 12 digit solution. (first digit will be 0..5)
also, we must note that we require 64 bit algebra.
iff we have 64 bit algebra, we could return a modulo, there is some more population on lower buckets but it's less then 0.1%, on the other hand we would have about the double of bucket, giving the OP less risks of clashing.
without 64 bit algebra, things get a little more complex:
my $md5 = ...;
my $lo = 0;
my $hi = 0;
for(split //, $md5)
{
$lo = $lo*256 + ord;
$hi = $hi*256 + $lo / 1_000_000;
$lo = $lo % 1_000_000;
$hi = $hi % 1_000_000;
}
return $hi.$lo;
| [reply] [d/l] |
|
|
my $md5 = ...;
my $dd = 0;
$dd = ($dd * 256 + ord) % 1e12
for split //, $md5;
print "$dd\n";
(actually, the biggest number that can appear in this algorithm is 2.56E+14, that can be represented in 48 bits) | [reply] [d/l] |
Re: Digest (checksum) Algorithm for 12 Decimal Digits?
by cmv (Chaplain) on Oct 08, 2007 at 15:23 UTC
|
You people are fantastic! Many thanks for all the help. Not only do you expertly answer the question, but you teach what is needed, to fill the gaps in my tiny little brain.
Specific thanks:
oha - Initial response, understandable "bucket" analogy, non-64-bit algebra example.
syphilis - oha support, great bit-position explanation.
salva - great code example, 64-bit algebra discussion.
You've helped out more than you'll know, and I bow to each of you.
A final bow to the perlmonks site in general, which has created a rose of a website in the abandoned-car-park of the internet.
-Craig | [reply] |