in reply to Re: [OT] The statistics of hashing.
in thread [OT] The statistics of hashing.
OK, I think I'm done with this diversion. I've built a program that will compute the expected number of collisions with no looping. Example output:
$ ./ana_2.pl 16 4 65536 N=65536, V=4, X=65536 integral(65536)=139415.765849051, integral(0)=13 +6533.333333333 Expected collisions: 2882.43251571807 $ ./ana_2.pl 16 4 16384 N=65536, V=4, X=16384 integral(16384)=136541.854969116, integral(0)=13 +6533.333333333 Expected collisions: 8.52163578287582 $ ./ana_2.pl 14 3 16384 N=16384, V=3, X=16384 integral(16384)=31411.9141476821, integral(0)=30 +037.3333333333 Expected collisions: 1374.58081434877 $ ./ana_2.pl 16 10 32768 N=65536, V=10, X=32768 integral(32768)=191953.190301726, integral(0)=1 +91952.863492063 Expected collisions: 0.326809662627056
The code is straightforward:
$ cat ana_2.pl #!/usr/bin/perl # # ana_2.pl N V X # # N=vector size (bits), V=number of vectors, X=sample number # use strict; use warnings; use feature ':5.10'; my $n=shift; $n = 1<<$n; my $v=shift; my $x=shift; my ($exp1, $exp2, $exp3); given ($v) { when ( 1) { $exp1=ex_1($n, $x), $exp2=ex_1($n, 0) } when ( 2) { $exp1=ex_2($n, $x), $exp2=ex_2($n, 0) } when ( 3) { $exp1=ex_3($n, $x), $exp2=ex_3($n, 0) } when ( 4) { $exp1=ex_4($n, $x), $exp2=ex_4($n, 0) } when (10) { $exp1=ex_10($n, $x), $exp2=ex_10($n, 0) } default { die "Need symbolic integral form for $v vectors!\n"; } } $exp3 = $exp1-$exp2; print "N=$n, V=$v, X=$x integral($x)=$exp1, integral(0)=$exp2\n"; print "Expected collisions: $exp3\n"; sub ex_1 { my ($N, $X) = @_; return $N *exp( -$X/$N) + $X; } sub ex_2 { my ($N, $X) = @_; return -$N/2*exp(-2*$X/$N) + 2*$N *exp( -$X/$N) + $X; } sub ex_3 { my ($N, $X) = @_; return $N/3*exp(-3*$X/$N) - 3*$N/2*exp(-2*$X/$N) + 3*$N *exp( -$X/$N) + $X; } sub ex_4 { my ($N, $X) = @_; return -$N/4*exp(-4*$X/$N) + 4*$N/3*exp(-3*$X/$N) - 3*$N *exp(-2*$X/$N) + 4*$N *exp( -$X/$N) + $X; } sub ex_10 { my ($N, $X) = @_; return -$N/10*exp(-10*$X/$N) + 10*$N/9 *exp( -9*$X/$N) - 45*$N/8 *exp( -8*$X/$N) + 120*$N/7 *exp( -7*$X/$N) - 35*$N *exp( -6*$X/$N) + 252*$N/5 *exp( -5*$X/$N) - 105*$N/2 *exp( -4*$X/$N) + 40*$N *exp( -3*$X/$N) - 45*$N/2 *exp( -2*$X/$N) + 10*$N *exp( -1*$X/$N) + $X; }
Notes:
...roboticus
When your only tool is a hammer, all problems look like your thumb.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: [OT] The statistics of hashing.(SOLVED)
by BrowserUk (Patriarch) on Apr 02, 2012 at 23:59 UTC | |
Thank you roboticus. I ran your algorithm against the results of my 60-hour run, and the correlation is simply stunning!
And that still includes the possibility -- though I believe it to be remote -- that there are 1 or more actual dups in there. My only regret now is that I wish I'd allowed the run to go to the 3/4 point instead of stopping it half way. The number of false positive would still have been easily manageable for the second pass:
That is an order of magnitude less that my own crude attempt was predicting, hence why I stopped the run. The only way I can repay you is to assure you that I will do my very best to try and understand the formula -- which I do not currently. And of course, give you credit, when they world comes knocking at my door for my eminently patentable -- at least if you take Apple as your guide -- algorithm :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by tye (Sage) on Apr 03, 2012 at 04:47 UTC | |
I will do my very best to try and understand the formula Perhaps this will help some. Below is some fairly simple code that does the precise calculations of the odds for a collision in a single hash (and compares those calculations with the formula roboticus proposed). I came up with a simpler implementation than I expected to. I didn't even try to implement this at first (only a little because I expected it to be more cumbersome, but) mostly because I knew it would be impractical for computing odds for such a large number of insertions. It consumes O($inserts) for memory and O($inserts**2) for CPU.
$i tracks the number of insertions done so far. $odds[$s] represents the odds of there being $s+1 bits set in the hash (after $i insertions). $avg is an average of these values of $s (1..@odds) weighted by the odds. But, more importantly, it is also the odds of getting a single-hash collision when inserting (after $i insertions) except multiplied by $bits ($b). I multiple it and 1-exp(-$i/$b) by $b to normalize to the expected number of set bits instead of the odds of a collision because humans have a much easier time identifying a number that is "close to 14" than a number that is "close to 14/2**32". $odds[-1] turns out to exactly match (successively) the values from birthday problem. For low numbers of $inserts, this swamps the calculation of $avg (the other terms just don't add up to a significant addition), which is part of why I was computing it for some values in my first reply. (Since you asked about that privately.) I have yet to refresh my memory of the exact power series expansion of exp($x), so what follows is actually half guesses, but I'm pretty confident of them based on vague memory and observed behavior. For large $bits, 1-exp(-$inserts/$bits) ends up being close to 1/$bits because 1-exp(-$inserts/$bits) expands (well, "can be expanded") to a power series where 1/$bits is the first term and the next term depends on 1/$bits**2 which is so much smaller that it doesn't matter much (and nor do any of the subsequent terms, even when added together). On the other hand, for large values of $inserts, 1-exp(-$inserts/$bits) is close to $avg because the formula for $avg matches the first $inserts terms of the power series expansion. I hope the simple code makes it easy for you to see how these calculations match the odds I described above. But don't hesitate to ask questions if the correspondence doesn't seem clear to you. Running the code shows how my calculations match roboticus' formula. Looking up (one of) the power series expansions for computing exp($x) should match the values being computed for @odds, though there might be some manipulation required to make the match apparent (based on previous times I've done such work decades ago). - tye | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Apr 03, 2012 at 15:59 UTC | |
Perhaps this will help some. I seriously hope this will not offend you, but suspect it will. Simply put, your post does not help me at all. I am a programmer, not a mathematician, but given a formula, in a form I can understand(*), I am perfectly capable of implementing that formula in code. And perfectly capable of coding a few loops and print statements in order to investigate its properties. What I have a problem with -- as evidently you do too -- is deriving those formula. Like you (according to your own words above; there is nothing accusatory here), my knowledge of calculus is confined to the coursework I did at college some {mumble mumble} decades ago. Whilst I retain an understanding of the principles of integeration; and recall some of its uses, the details are shrouded in a cloud of disuse. Use it or lose it, is a very current, and very applicable aphorism. The direction my career has taken me means that I've had no more than a couple of occasions when calculus would have been useful. And on both those occasions, I succeeded in finding "a man that can", who could provide me with an understandable formula, and thus, I achieved my goal without having to relive the history of mathematics. (*) A big part of the problem is that mathematicians not only have a nomenclature -- which is necessary -- the also have 'historical conventions' -- which are not; and the latter are the absolute bane of the lay-person's life in trying to understand the mathematician's output. There you are, happily following along when reach a text that goes something like this: We may think intuatively of the Riemann sum: Ʃba f(x) dx Where did H come from? Where did a disappear to? Is H (by convention) == to b - a? For the answer to this and other questions, tune in With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by tye (Sage) on Apr 03, 2012 at 20:16 UTC | |
| |
|
Re^3: [OT] The statistics of hashing.
by BrowserUk (Patriarch) on Apr 03, 2012 at 12:39 UTC | |
With the 'cancelling out' you've performed on the constants in your ex_*() subs, I couldn't see the pattern by which they were derived. Undoing that, I now see that they come (directly or not) from Pascal's Triangle. I guess at some point I should review some teaching materials to understand why those constants are used here, but for now it is enough to know that I can now write a generic ex_*() function (generator). Thanks. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by roboticus (Chancellor) on Apr 03, 2012 at 13:32 UTC | |
In the iterative solution, we're accumulating f(x)=(1-e^(-x/N))^h over the range of x=0 .. NumSamples. That's a rough form of computing the definite integral of the expression. Integrating over three variables (x, N, h) would be a pain, so I treated N and h as constants. So first, we multiply out our f(x) expression to remove the exponent. So using h as 2, we get:
Computing a definite integral over a range is simply integ(f(x)) evaluated at the upper limit less the value evaluated at the lower limit. This causes the C terms to cancel. Pascal's triangle comes out because we've got (a+b)^n, and when we multiply it out, we get the binomial expansion which is where the coefficients come into play. One point I should mention: You don't have to use 0 as the lower bound. If you wanted the number of collisions you'd experience from sample A to sample B, just evaluate integ(f(B))-integ(f(A)). By using A=0 we compute the number of collisions for the entire run. ...roboticus When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] |
by BrowserUk (Patriarch) on Apr 03, 2012 at 15:01 UTC | |
Thanks. As may be (becoming) obvious, much of that is over my head for now :) However, Now I know how to derive the constants, I have this which I can substitute for your ex_4() & ex_10() by supplying the power as the first argument. Its output matches those two exactly for all the argument combinations I've tried:
That allowed me to investigate the affect of using fewer or more hashes without having to hand code a new function for each. And the results are quite interesting. This shows that each new (pair?) of hashes added does increase the discrimination substantially, though the gains obviously fall off fairly rapidly. But the really interesting part are the numbers for odd numbers of hashes:
I suspect it is a coding error on my behalf, but I guess it could be a quirk of the numbers? Here's a set using 1 .. 16 2**16 bit hashes:
Sorry for the wrapping. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by tye (Sage) on Apr 03, 2012 at 15:43 UTC | |
by BrowserUk (Patriarch) on Apr 03, 2012 at 16:07 UTC | |