in reply to Re^2: [OT] The statistics of hashing.
in thread [OT] The statistics of hashing.
Thanks syphilis. Your calculations make sense to me. But I'm not sure that it gels with the actual data?
Assuming I've coded your formula correctly (maybe not!), then using 10 hashes & vectors, I get the odds of having seen a dup after 1e9 inserts as (1 - ((4294967295/4294967296)**1e9) ) **10 := 0.00000014949378123.
By that point I had actually seen 13 collisions:
printf "After %5.5s inserts the odds of no dups: %.17f\n", $_, ( 1 - ( 0.99999999976716935634613037109375 ** $_ ) ) )**10 for map "${_}e8", 1..40;; After 1e8 inserts the odds of no dups: 0.00000000000000004 After 2e8 inserts the odds of no dups: 0.00000000000003802 After 3e8 inserts the odds of no dups: 0.00000000000195354 After 4e8 inserts the odds of no dups: 0.00000000003092706 After 5e8 inserts the odds of no dups: 0.00000000025689926 After 6e8 inserts the odds of no dups: 0.00000000141936970 After 7e8 inserts the odds of no dups: 0.00000000591942653 After 8e8 inserts the odds of no dups: 0.00000002009607516 After 9e8 inserts the odds of no dups: 0.00000005831019113 After 10e8 inserts the odds of no dups: 0.00000014949378123 ** After 11e8 inserts the odds of no dups: 0.00000034677629037 After 12e8 inserts the odds of no dups: 0.00000074067890139 After 13e8 inserts the odds of no dups: 0.00000147618719432 After 14e8 inserts the odds of no dups: 0.00000277379244752 After 15e8 inserts the odds of no dups: 0.00000495435410008 After 16e8 inserts the odds of no dups: 0.00000846740103378 After 17e8 inserts the odds of no dups: 0.00001392227490012 After 18e8 inserts the odds of no dups: 0.00002212133818548 After 19e8 inserts the odds of no dups: 0.00003409433217130 After 20e8 inserts the odds of no dups: 0.00005113288027133 After 21e8 inserts the odds of no dups: 0.00007482409152212 After 22e8 inserts the odds of no dups: 0.00010708222525783 After 23e8 inserts the odds of no dups: 0.00015017742682200 After 24e8 inserts the odds of no dups: 0.00020676062948530 After 25e8 inserts the odds of no dups: 0.00027988383247111 After 26e8 inserts the odds of no dups: 0.00037301510162471 After 27e8 inserts the odds of no dups: 0.00049004779031911 After 28e8 inserts the odds of no dups: 0.00063530363659898 After 29e8 inserts the odds of no dups: 0.00081352955192231 After 30e8 inserts the odds of no dups: 0.00102988807161069 After 31e8 inserts the odds of no dups: 0.00128994158265037 After 32e8 inserts the odds of no dups: 0.00159963057715543 After 33e8 inserts the odds of no dups: 0.00196524629692540 After 34e8 inserts the odds of no dups: 0.00239339823431121 After 35e8 inserts the odds of no dups: 0.00289097703606607 After 36e8 inserts the odds of no dups: 0.00346511341973626 After 37e8 inserts the odds of no dups: 0.00412313375677405 After 38e8 inserts the odds of no dups: 0.00487251300375936 After 39e8 inserts the odds of no dups: 0.00572082567410912 After 40e8 inserts the odds of no dups: 0.00667569553892502
And looking at the figure for 4e9 := 0.00667569553892502, by which time the 10 vectors will be almost fully populated, it looks way too low to me?
I would have expected that calculation (for N=4e9) to have yielded odds of almost 1?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: [OT] The statistics of hashing.
by syphilis (Archbishop) on Apr 01, 2012 at 19:33 UTC | |
by BrowserUk (Patriarch) on Apr 01, 2012 at 20:06 UTC | |
by syphilis (Archbishop) on Apr 02, 2012 at 04:02 UTC |