Re^2: [OT] The statistics of hashing.

Replies are listed 'Best First'.
Re^3: [OT] The statistics of hashing.(SOLVED) by BrowserUk (Patriarch) on Apr 02, 2012 at 23:59 UTC
Thank you roboticus. I ran your algorithm against the results of my 60-hour run, and the correlation is simply stunning! #! perl -slw use strict; sub ex_10 { my ($N, $X) = @_; return -$N/10exp(-10$X/$N) + 10$N/9 exp( -9$X/$N) - 45$N/8 exp( -8$X/$N) + 120$N/7 exp( -7$X/$N) - 35$N exp( -6$X/$N) + 252$N/5 exp( -5$X/$N) - 105$N/2 exp( -4$X/$N) + 40$N exp( -3$X/$N) - 45$N/2 exp( -2$X/$N) + 10$N exp( -1$X/$N) + $X; } my $exp10_0 = ex_10( 232, 0 ); open I, '<', 'L25B32V10I32S1.posns' or die $!; while( <I> ) { my( $inserts, $collisions ) = split; my $exp3 = ex_10( 232, $inserts ) - $exp10_0; printf "Insert %10d: Predicted %6d Actual: %6d Delta: %+.3f%%\n" +, $inserts, $exp3, $collisions, ( $collisions - $exp3 ) / $exp3 + 100; } __END__ C:\test>rbtcs-form-verify Insert 779967210: Predicted 1 Actual: 1 Delta: -18.051% Insert 782382025: Predicted 1 Actual: 2 Delta: +58.813% Insert 830840395: Predicted 2 Actual: 3 Delta: +29.292% Insert 882115420: Predicted 4 Actual: 4 Delta: -5.961% Insert 883031614: Predicted 4 Actual: 5 Delta: +16.325% Insert 897571477: Predicted 5 Actual: 6 Delta: +18.390% Insert 923155269: Predicted 6 Actual: 7 Delta: +4.086% Insert 948108745: Predicted 8 Actual: 8 Delta: -8.996% Insert 954455113: Predicted 9 Actual: 9 Delta: -4.244% Insert 967783959: Predicted 10 Actual: 10 Delta: -7.404% Insert 988381482: Predicted 13 Actual: 11 Delta: -17.487% Insert 992691311: Predicted 13 Actual: 12 Delta: -13.814% Insert 995935158: Predicted 14 Actual: 13 Delta: -9.624% Insert 1011301141: Predicted 16 Actual: 14 Delta: -16.457% Insert 1013742872: Predicted 17 Actual: 15 Delta: -12.616% Insert 1022258193: Predicted 18 Actual: 16 Delta: -14.242% Insert 1031874023: Predicted 20 Actual: 17 Delta: -16.989% Insert 1034026887: Predicted 20 Actual: 18 Delta: -13.909% Insert 1036254774: Predicted 21 Actual: 19 Delta: -11.051% Insert 1037064360: Predicted 21 Actual: 20 Delta: -7.093% Insert 1037193945: Predicted 21 Actual: 21 Delta: -2.569% Insert 1037309710: Predicted 21 Actual: 22 Delta: +1.957% ...( ~ 52000 ellided } Insert 2375752842: Predicted 52209 Actual: 52277 Delta: +0.130% Insert 2375755671: Predicted 52209 Actual: 52278 Delta: +0.131% Insert 2375756509: Predicted 52209 Actual: 52279 Delta: +0.133% Insert 2375760656: Predicted 52210 Actual: 52280 Delta: +0.133% Insert 2375763928: Predicted 52211 Actual: 52281 Delta: +0.134% Insert 2375785238: Predicted 52215 Actual: 52282 Delta: +0.128% Insert 2375788721: Predicted 52215 Actual: 52283 Delta: +0.128% Insert 2375789878: Predicted 52216 Actual: 52284 Delta: +0.130% Insert 2375790896: Predicted 52216 Actual: 52285 Delta: +0.131% Insert 2375798283: Predicted 52217 Actual: 52286 Delta: +0.131% [download] And that still includes the possibility -- though I believe it to be remote -- that there are 1 or more actual dups in there. My only regret now is that I wish I'd allowed the run to go to the 3/4 point instead of stopping it half way. The number of false positive would still have been easily manageable for the second pass: `C:\test>rbtcs-bloom-probb 32 10 30e8 N=4294967296, V=10, X=30e8 integral(30e8)=12580199467.653, integral(0) +=12579822861.8159 Expected collisions: 376605.837186813` [download] That is an order of magnitude less that my own crude attempt was predicting, hence why I stopped the run. The only way I can repay you is to assure you that I will do my very best to try and understand the formula -- which I do not currently. And of course, give you credit, when they world comes knocking at my door for my eminently patentable -- at least if you take Apple as your guide -- algorithm :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^4: [OT] The statistics of hashing. (dissection) by tye (Sage) on Apr 03, 2012 at 04:47 UTC
I will do my very best to try and understand the formula Perhaps this will help some. Below is some fairly simple code that does the precise calculations of the odds for a collision in a single hash (and compares those calculations with the formula roboticus proposed). I came up with a simpler implementation than I expected to. I didn't even try to implement this at first (only a little because I expected it to be more cumbersome, but) mostly because I knew it would be impractical for computing odds for such a large number of insertions. It consumes O($inserts) for memory and O($inserts2) for CPU. $\|= 1; my( $b )= ( @ARGV, 232 ); # Total number of bits in the hash. my $i= 1; # Number of insertions done so far. my @odds = 1; # $odds[$s] == Odds of their being $s+1 bi +ts set while( 1 ) { # Just hit Ctrl-C when you've seen enough my $exp= $b( 1 - exp(-$i/$b) ); my $avg = 0; $avg += $_ for map $odds[$_]($_+1), 0..$#odds; my $err = sprintf "%.6f%%", 100($exp-$avg)/$avg; print "$i inserts, $err: avg=$avg exp=$exp bits set\n" if $i =~ /^\d0$/; # Update @odds to in preparation for the next value of $i: for my $s ( reverse 0..$#odds ) { $odds[$s+1] += $odds[$s]($b-$s-1)/$b; $odds[$s] = ($s+1)/$b; } $i++; } [download] $i tracks the number of insertions done so far. `$odds[$s]` represents the odds of there being $s+1 bits set in the hash (after $i insertions). $avg is an average of these values of $s (1..@odds) weighted by the odds. But, more importantly, it is also the odds of getting a single-hash collision when inserting (after $i insertions) except multiplied by $bits ($b). I multiple it and 1-exp(-$i/$b) by $b to normalize to the expected number of set bits instead of the odds of a collision because humans have a much easier time identifying a number that is "close to 14" than a number that is "close to 14/232". `$odds[-1]` turns out to exactly match (successively) the values from birthday problem. For low numbers of $inserts, this swamps the calculation of $avg (the other terms just don't add up to a significant addition), which is part of why I was computing it for some values in my first reply. (Since you asked about that privately.) I have yet to refresh my memory of the exact power series expansion of exp($x), so what follows is actually half guesses, but I'm pretty confident of them based on vague memory and observed behavior. For large $bits, 1-exp(-$inserts/$bits) ends up being close to 1/$bits because 1-exp(-$inserts/$bits) expands (well, "can be expanded") to a power series where 1/$bits is the first term and the next term depends on 1/$bits2 which is so much smaller that it doesn't matter much (and nor do any of the subsequent terms, even when added together). On the other hand, for large values of $inserts, 1-exp(-$inserts/$bits) is close to $avg because the formula for $avg matches the first $inserts terms of the power series expansion. I hope the simple code makes it easy for you to see how these calculations match the odds I described above. But don't hesitate to ask questions if the correspondence doesn't seem clear to you. Running the code shows how my calculations match roboticus' formula. Looking up (one of) the power series expansions for computing exp($x) should match the values being computed for @odds, though there might be some manipulation required to make the match apparent (based on previous times I've done such work decades ago). - tye	[reply] [d/l] [select]
Re^5: [OT] The statistics of hashing. (dissection) by BrowserUk (Patriarch) on Apr 03, 2012 at 15:59 UTC
Perhaps this will help some. I seriously hope this will not offend you, but suspect it will. Simply put, your post does not help me at all. I am a programmer, not a mathematician, but given a formula, in a form I can understand(), I am perfectly capable of implementing that formula in code. And perfectly capable of coding a few loops and print statements in order to investigate its properties. What I have a problem with -- as evidently you do too -- is deriving those formula. Like you (according to your own words above; there is nothing accusatory here), my knowledge of calculus is confined to the coursework I did at college some {mumble mumble} decades ago. Whilst I retain an understanding of the principles of integeration; and recall some of its uses, the details are shrouded in a cloud of disuse. Use it or lose it, is a very current, and very applicable aphorism. The direction my career has taken me means that I've had no more than a couple of occasions when calculus would have been useful. And on both those occasions, I succeeded in finding "a man that can", who could provide me with an understandable* formula, and thus, I achieved my goal without having to relive the history of mathematics. () A big part of the problem is that mathematicians not only have a nomenclature -- which is necessary -- the also have 'historical conventions' -- which are not; and the latter are the absolute bane of the lay-person's life in trying to understand the mathematician's output. There you are, happily following along when reach a text that goes something like this: We may think intuatively of the Riemann sum: Ʃ^b_a f(x) dx* as the infinite sum: f(x₀)dx + f(x₁)dx + ... + f(x_{H - 1})dx + f(x_H)(b - x_H) Where did H come from? Where did a disappear to? Is H (by convention) == to b - a? For the answer to this and other questions, tune in ~~next week~~ ..... to the last 400 (or sometimes 4000) years of the history of math With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^6: [OT] The statistics of hashing. (dissection) by tye (Sage) on Apr 03, 2012 at 20:16 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: [OT] The statistics of hashing. by BrowserUk (Patriarch) on Apr 03, 2012 at 12:39 UTC
With the 'cancelling out' you've performed on the constants in your ex_() subs, I couldn't see the pattern by which they were derived. Undoing that, I now see that they come (directly or not) from Pascal's Triangle. I guess at some point I should review some teaching materials to understand why those constants are used here, but for now it is enough to know that I can now write a generic `ex_()` function (generator). Thanks. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^4: [OT] The statistics of hashing. by roboticus (Chancellor) on Apr 03, 2012 at 13:32 UTC
BrowserUk: In the iterative solution, we're accumulating f(x)=(1-e^(-x/N))^h over the range of x=0 .. NumSamples. That's a rough form of computing the definite integral of the expression. Integrating over three variables (x, N, h) would be a pain, so I treated N and h as constants. So first, we multiply out our f(x) expression to remove the exponent. So using h as 2, we get: `f(x) = 1 - 2e^(-x/N) + e^(-2x/N) integ(f(x)) = integ( 1 - 2e^(-x/N) + e^(-2x/N) ) = integ( 1 ) - 2(integ(e^(-x/N))) + integ(e^(-2x/N)) since: integ(1) = x + C integ(e^(bx)) = (1/b)e^(bx) + C integ(f(x)) = [ x+C ] + [ (-2N)e^(-x/N) + C ] + [ (-N/2)e^(-2x/N) + C +]` [download] Computing a definite integral over a range is simply integ(f(x)) evaluated at the upper limit less the value evaluated at the lower limit. This causes the C terms to cancel. Pascal's triangle comes out because we've got (a+b)^n, and when we multiply it out, we get the binomial expansion which is where the coefficients come into play. One point I should mention: You don't have to use 0 as the lower bound. If you wanted the number of collisions you'd experience from sample A to sample B, just evaluate integ(f(B))-integ(f(A)). By using A=0 we compute the number of collisions for the entire run. ...roboticus When your only tool is a hammer, all problems look like your thumb.*	[reply] [d/l]
Re^5: [OT] The statistics of hashing. by BrowserUk (Patriarch) on Apr 03, 2012 at 15:01 UTC
Thanks. As may be (becoming) However, Now I know how `That allowed me to investigate` And the results are quite Using N x 232 vectors: 12 : 24 + 16 : 32 + 24 : 48 + 32 : 64 + 48 : 95 + 64 : 127 + 96 : 191 + 128 : 255 + 192 : 383 + 256 : 511 + 384 : 767 + 512 : 1023 + 768 : 1535 + 1024 : 2047 + 1536 : 3071 + 2048 : 4095 + 3072 : 6143 + 4096 : 8191 + 6144 : 12287 + 8192 : 16383 + 12288 : 24575 + 16384 : 32767 + 24576 : 49151 + 32768 : 65535 + 49152 : 98303 + 65536 : 131071 + 98304 : 196606 + 131072 : 262142 + 196608 : 393211 + 262144 : 524280 + 393216 : 786414 + 524288 : 1048544 + 786432 : 1572792 + 1048576 : 2097024 + 1572864 : 3145440 + 2097152 : 4193792 + 3145728 : 6290304 + 4194304 : 8386560 + 6291456 : 12578306 + 8388608 : 16769029 + 12582912 : 25147409 + 16777216 : 33521706 + 25165824 : 50258063 + 33554432 : 66978132 + 50331648 : 100369532 + 67108864 : 133696160 + 100663296 : 200156106 + 134217728 : 266359979 + 201326592 : 398007464 + 268435456 : 528654369 + 402653184 : 787008255 + 536870912 : 1041542872 + 805306368 : 1539620713 + 1073741824 : 2023785226 + 1610612736 : 2953695056 + 2147483648 : 3837421596 + 3221225472 : 5487393872 +13112112 4294967296 : 7009904423 +56541428 I suspect it is a coding Here's a set using 1 Using N x 216 vectors: 12 : 23 0 23 + 24 16 : 31 0 31 + 32 24 : 47 0 47 + 48 32 : 63 0 63 + 64 48 : 95 0 95 + 95 64 : 127 0 127 + 128 96 : 191 0 191 + 192 128 : 255 0 255 + 256 192 : 383 0 383 + 384 256 : 511 0 511 + 512 384 : 766 0 767 + 768 512 : 1022 0 1023 + 1024 768 : 1531 0 1535 + 1536 1024 : 2040 0 2047 + 2048 1536 : 3054 0 3071 + 3072 2048 : 4064 0 4095 + 4095 3072 : 6073 2 6143 + 6144 4096 : 8066 5 8191 + 8191 6144 : 12008 16 12286 +12287 8192 : 15892 38 16380 +16383 12288 : 23492 125 24559 +24575 16384 : 30880 284 32720 +32767 24576 : 45069 877 48942 +49151 32768 : 58554 1908 64958 +65535 49152 : 83730 5450 96062 +98282 65536 : 106962 11016 125573 +30912 Sorry for the wrapping. With the rise and rise Examine what is said, "Science is about questioning In the absence of evidence, [download] the affect of using fewer or more hashes without having to hand code a new function for each. interesting. This shows that each new (pair?) of hashes added does increase the discrimination substantially, though the gains obviously fall off fairly rapidly. But the really interesting part are the numbers for odd numbers of hashes: C:\test>rbtcs-form-verify -H=10 -B=32 0 24 0 24 0 24 0 24 0 0 31 0 31 0 31 0 31 0 0 48 0 48 0 48 0 48 0 0 64 0 64 0 64 0 64 0 0 95 0 95 0 95 0 96 0 0 127 0 127 0 128 0 128 0 0 191 0 192 0 192 0 192 0 0 256 0 256 0 256 0 256 0 0 383 0 383 0 384 0 384 0 0 512 0 512 0 512 0 512 0 0 768 0 768 0 768 0 768 0 0 1024 0 1024 0 1024 0 1024 0 0 1536 0 1536 0 1536 0 1536 0 0 2048 0 2048 0 2048 0 2048 0 0 3072 0 3072 0 3072 0 3072 0 0 4096 0 4096 0 4096 0 4096 0 0 6144 0 6144 0 6144 0 6144 0 0 8192 0 8192 0 8192 0 8192 0 0 12288 0 12288 0 12288 0 12288 0 0 16384 0 16383 0 16383 0 16384 0 0 24576 0 24576 0 24576 0 24576 0 0 32767 0 32767 0 32767 0 32768 0 0 49151 0 49152 0 49152 0 49152 0 0 65536 0 65536 0 65536 0 65536 0 0 98304 0 98304 0 98303 0 98303 0 0 131072 0 131071 0 131072 0 131072 0 0 196608 0 196607 0 196608 0 196608 0 0 262144 0 262144 0 262143 0 262144 0 0 393216 0 393216 0 393215 0 393216 0 0 524288 0 524287 0 524287 0 524288 0 0 786431 0 786431 0 786432 0 786432 0 0 1048575 0 1048576 0 1048576 0 1048576 0 0 1572863 0 1572864 0 1572864 0 1572864 0 0 2097151 0 2097151 0 2097152 0 2097152 0 0 3145727 0 3145728 0 3145727 0 3145728 0 0 4194303 0 4194304 0 4194304 0 4194304 0 0 6291455 0 6291455 0 6291455 0 6291455 0 1 8388607 0 8388608 0 8388608 0 8388608 0 4 12582911 0 12582912 0 12582912 0 12582912 0 10 16777215 0 16777216 0 16777215 0 16777216 0 35 25165823 0 25165823 0 25165824 0 25165824 0 85 33554431 0 33554431 0 33554431 0 33554432 0 286 50331646 0 50331647 0 50331648 0 50331648 0 678 67108860 0 67108863 0 67108864 0 67108864 0 2283 100663276 0 100663295 0 100663296 0 100663296 0 5397 134217665 0 134217727 0 134217728 0 134217728 0 18111 201326276 5 201326591 0 201326591 0 201326592 0 42681 268434469 24 268435455 0 268435455 0 268435455 0 142383 402648282 179 402653177 0 402653183 0 402653184 0 333608 536855705 738 536870874 1 536870911 0 536870911 0 1100214 805232175 5328 805305969 30 805306365 0 805306367 0 2548692 1073515796 21337 1073739728 211 1073741802 2 1073741823 0 8218836 1609548824 146448 1610591774 3082 1610612273 70 1610612725 1 18623993 2144354575 558473 2147380065 19731 2147479815 755 2147483497 30 57532294 3207472286 3485537 3220308576 247526 3221157361 19018 3221220099 1532 125076314 4257101987 12129165 4290939237 1371778 4294491373 167491 4294907677 21417 357203949 6295545197 63685472 6413893311 6436324179 2901774 6441061714 670967 721946381 8229596479 188903097 8487725572 8558136669 18112153 8579512265 6047518 href="?part=4;node_id=963260;abspart=1;displaytype=displaycode">[download] error on my behalf, but I guess it could be a quirk of the numbers? .. 16 2**16 bit hashes:C:\test>rbtcs-form-verify -H=16 -B=16 0 24 0 24 0 0 23 0 23 0 23 0 0 31 0 32 0 0 32 0 32 0 32 0 0 48 0 47 0 0 47 0 47 0 47 0 0 64 0 64 0 0 64 0 64 0 64 0 0 95 0 95 0 0 96 0 96 0 96 0 0 128 0 128 0 0 127 0 128 0 127 0 0 192 0 192 0 0 192 0 191 0 191 0 0 256 0 255 0 0 256 0 256 0 256 0 0 383 0 384 0 0 384 0 384 0 384 0 0 511 0 511 0 0 511 0 512 0 512 0 0 767 0 768 0 0 767 0 768 0 767 0 0 1023 0 1024 0 0 1024 0 1024 0 1024 0 0 1535 0 1536 0 0 1536 0 1536 0 1536 0 0 2047 0 2048 0 0 2048 0 2048 0 2048 0 0 3071 0 3071 0 0 3072 0 3072 0 3072 0 0 4095 0 4095 0 0 4096 0 4096 0 4096 0 0 6143 0 6143 0 0 6144 0 6144 0 6144 0 0 8191 0 8191 0 0 8191 0 8191 0 8192 0 0 12287 0 12287 0 0 12287 0 12288 0 12287 0 0 16383 0 16383 0 0 16383 0 16383 0 16383 0 2 24575 0 24575 0 0 24575 0 24575 0 24576 0 8 32766 0 32767 0 0 32767 0 32767 0 32767 0 53 49138 3 49150 0 0 49151 0 49151 0 49151 0 185 65474 20 65528 2 0 65535 0 65535 0 65535 0 971 97868 200 98210 44 10 98299 2 98302 0 98303 0 2882 129512 862 130586 276 1 92 131018 31 131053 11 131065 3 href="?node_id=963260;part=5;displaytype=displaycode;abspart=1">[download] class="pmsig-171588"> of 'Social' network sites: 'Computers are making people easier to use everyday' not who speaks -- Silence betokens consent -- Love the truth but pardon error. the status quo. Questioning authority". opinion is indistinguishable from prejudice. www.theregister.co.uk/2011/11/29/sas_versus_world_programming/">The start of some sanity?	[reply] [d/l] [select]
Re^6: [OT] The statistics of hashing. (odd) by tye (Sage) on Apr 03, 2012 at 15:43 UTC
Re^7: [OT] The statistics of hashing. (odd) by BrowserUk (Patriarch) on Apr 03, 2012 at 16:07 UTC

I ran your algorithm against the results of my 60-hour run, and the correlation is simply stunning!

#! perl -slw
use strict;

sub ex_10 {
   my ($N, $X) = @_;
    return       -$N/10*exp(-10*$X/$N)
            +  10*$N/9 *exp( -9*$X/$N)
            -  45*$N/8 *exp( -8*$X/$N)
            + 120*$N/7 *exp( -7*$X/$N)
            -  35*$N   *exp( -6*$X/$N)
            + 252*$N/5 *exp( -5*$X/$N)
            - 105*$N/2 *exp( -4*$X/$N)
            +  40*$N   *exp( -3*$X/$N)
            -  45*$N/2 *exp( -2*$X/$N)
            +  10*$N   *exp( -1*$X/$N)
            + $X;
}

my $exp10_0 = ex_10( 2**32, 0 );

open I, '<', 'L25B32V10I32S1.posns' or die $!;


while( <I> ) {
    my( $inserts, $collisions ) = split;
    my $exp3 = ex_10( 2**32, $inserts ) - $exp10_0;
    printf "Insert  %10d: Predicted  %6d Actual: %6d Delta: %+.3f%%\n"
+,
        $inserts, $exp3, $collisions, ( $collisions - $exp3 ) / $exp3 
+* 100;
}

__END__
C:\test>rbtcs-form-verify
Insert   779967210: Predicted       1 Actual:      1 Delta: -18.051%
Insert   782382025: Predicted       1 Actual:      2 Delta: +58.813%
Insert   830840395: Predicted       2 Actual:      3 Delta: +29.292%
Insert   882115420: Predicted       4 Actual:      4 Delta: -5.961%
Insert   883031614: Predicted       4 Actual:      5 Delta: +16.325%
Insert   897571477: Predicted       5 Actual:      6 Delta: +18.390%
Insert   923155269: Predicted       6 Actual:      7 Delta: +4.086%
Insert   948108745: Predicted       8 Actual:      8 Delta: -8.996%
Insert   954455113: Predicted       9 Actual:      9 Delta: -4.244%
Insert   967783959: Predicted      10 Actual:     10 Delta: -7.404%
Insert   988381482: Predicted      13 Actual:     11 Delta: -17.487%
Insert   992691311: Predicted      13 Actual:     12 Delta: -13.814%
Insert   995935158: Predicted      14 Actual:     13 Delta: -9.624%
Insert  1011301141: Predicted      16 Actual:     14 Delta: -16.457%
Insert  1013742872: Predicted      17 Actual:     15 Delta: -12.616%
Insert  1022258193: Predicted      18 Actual:     16 Delta: -14.242%
Insert  1031874023: Predicted      20 Actual:     17 Delta: -16.989%
Insert  1034026887: Predicted      20 Actual:     18 Delta: -13.909%
Insert  1036254774: Predicted      21 Actual:     19 Delta: -11.051%
Insert  1037064360: Predicted      21 Actual:     20 Delta: -7.093%
Insert  1037193945: Predicted      21 Actual:     21 Delta: -2.569%
Insert  1037309710: Predicted      21 Actual:     22 Delta: +1.957%
...( ~ 52000 ellided }
Insert  2375752842: Predicted   52209 Actual:  52277 Delta: +0.130%
Insert  2375755671: Predicted   52209 Actual:  52278 Delta: +0.131%
Insert  2375756509: Predicted   52209 Actual:  52279 Delta: +0.133%
Insert  2375760656: Predicted   52210 Actual:  52280 Delta: +0.133%
Insert  2375763928: Predicted   52211 Actual:  52281 Delta: +0.134%
Insert  2375785238: Predicted   52215 Actual:  52282 Delta: +0.128%
Insert  2375788721: Predicted   52215 Actual:  52283 Delta: +0.128%
Insert  2375789878: Predicted   52216 Actual:  52284 Delta: +0.130%
Insert  2375790896: Predicted   52216 Actual:  52285 Delta: +0.131%
Insert  2375798283: Predicted   52217 Actual:  52286 Delta: +0.131%
[download]

And that still includes the possibility -- though I believe it to be remote -- that there are 1 or more actual dups in there.

My only regret now is that I wish I'd allowed the run to go to the 3/4 point instead of stopping it half way. The number of false positive would still have been easily manageable for the second pass:

C:\test>rbtcs-bloom-probb 32 10 30e8
N=4294967296, V=10, X=30e8 integral(30e8)=12580199467.653, integral(0)
+=12579822861.8159
Expected collisions: 376605.837186813
[download]

That is an order of magnitude less that my own crude attempt was predicting, hence why I stopped the run.

The only way I can repay you is to assure you that I will do my very best to try and understand the formula -- which I do not currently.

And of course, give you credit, when they world comes knocking at my door for my eminently patentable -- at least if you take Apple as your guide -- algorithm :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

[reply]
[d/l]
[select]

I will do my very best to try and understand the formula

Perhaps this will help some.

Below is some fairly simple code that does the precise calculations of the odds for a collision in a single hash (and compares those calculations with the formula roboticus proposed). I came up with a simpler implementation than I expected to. I didn't even try to implement this at first (only a little because I expected it to be more cumbersome, but) mostly because I knew it would be impractical for computing odds for such a large number of insertions. It consumes O($inserts) for memory and O($inserts**2) for CPU.

$|= 1;
my( $b )= ( @ARGV, 2**32 ); # Total number of bits in the hash.
my $i= 1;                   # Number of insertions done so far.
my @odds = 1;               # $odds[$s] == Odds of their being $s+1 bi
+ts set
while( 1 ) {                # Just hit Ctrl-C when you've seen enough
    my $exp= $b*( 1 - exp(-$i/$b) );
    my $avg = 0;
    $avg += $_ for map $odds[$_]*($_+1), 0..$#odds;
    my $err = sprintf "%.6f%%", 100*($exp-$avg)/$avg;
    print "$i inserts, $err: avg=$avg exp=$exp bits set\n"
        if  $i =~ /^\d0*$/;
    # Update @odds to in preparation for the next value of $i:
    for my $s (  reverse 0..$#odds  ) {
        $odds[$s+1] += $odds[$s]*($b-$s-1)/$b;
        $odds[$s] *= ($s+1)/$b;
    }
    $i++;
}
[download]

$i tracks the number of insertions done so far. $odds[$s] represents the odds of there being $s+1 bits set in the hash (after $i insertions). $avg is an average of these values of $s (1..@odds) weighted by the odds. But, more importantly, it is also the odds of getting a single-hash collision when inserting (after $i insertions) except multiplied by $bits ($b). I multiple it and 1-exp(-$i/$b) by $b to normalize to the expected number of set bits instead of the odds of a collision because humans have a much easier time identifying a number that is "close to 14" than a number that is "close to 14/2**32".

$odds[-1] turns out to exactly match (successively) the values from birthday problem. For low numbers of $inserts, this swamps the calculation of $avg (the other terms just don't add up to a significant addition), which is part of why I was computing it for some values in my first reply. (Since you asked about that privately.)

I have yet to refresh my memory of the exact power series expansion of exp($x), so what follows is actually half guesses, but I'm pretty confident of them based on vague memory and observed behavior.

For large $bits, 1-exp(-$inserts/$bits) ends up being close to 1/$bits because 1-exp(-$inserts/$bits) expands (well, "can be expanded") to a power series where 1/$bits is the first term and the next term depends on 1/$bits**2 which is so much smaller that it doesn't matter much (and nor do any of the subsequent terms, even when added together).

On the other hand, for large values of $inserts, 1-exp(-$inserts/$bits) is close to $avg because the formula for $avg matches the first $inserts terms of the power series expansion.

I hope the simple code makes it easy for you to see how these calculations match the odds I described above. But don't hesitate to ask questions if the correspondence doesn't seem clear to you. Running the code shows how my calculations match roboticus' formula. Looking up (one of) the power series expansions for computing exp($x) should match the values being computed for @odds, though there might be some manipulation required to make the match apparent (based on previous times I've done such work decades ago).

- tye

[reply]
[d/l]
[select]

Perhaps this will help some.

I seriously hope this will not offend you, but suspect it will.

Simply put, your post does not help me at all.

I am a programmer, not a mathematician, but given a formula, in a form I can understand(*), I am perfectly capable of implementing that formula in code. And perfectly capable of coding a few loops and print statements in order to investigate its properties.

What I have a problem with -- as evidently you do too -- is deriving those formula.

Like you (according to your own words above; there is nothing accusatory here), my knowledge of calculus is confined to the coursework I did at college some {mumble mumble} decades ago. Whilst I retain an understanding of the principles of integeration; and recall some of its uses, the details are shrouded in a cloud of disuse.

Use it or lose it, is a very current, and very applicable aphorism.

The direction my career has taken me means that I've had no more than a couple of occasions when calculus would have been useful. And on both those occasions, I succeeded in finding "a man that can", who could provide me with an understandable formula, and thus, I achieved my goal without having to relive the history of mathematics.

(*) A big part of the problem is that mathematicians not only have a nomenclature -- which is necessary -- the also have 'historical conventions' -- which are not; and the latter are the absolute bane of the lay-person's life in trying to understand the mathematician's output.

There you are, happily following along when reach a text that goes something like this:

We may think intuatively of the Riemann sum: Ʃ^b_a f(x) dx
as the infinite sum: f(x₀)dx + f(x₁)dx + ... + f(x_{H - 1})dx + f(x_H)(b - x_H)

Where did H come from? Where did a disappear to? Is H (by convention) == to b - a?

For the answer to this and other questions, tune in ~~next week~~ ..... to the last 400 (or sometimes 4000) years of the history of math

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

[reply]

With the 'cancelling out' you've performed on the constants in your ex_*() subs, I couldn't see the pattern by which they were derived.

Undoing that, I now see that they come (directly or not) from Pascal's Triangle.

I guess at some point I should review some teaching materials to understand why those constants are used here, but for now it is enough to know that I can now write a generic ex_*() function (generator). Thanks.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

[reply]
[d/l]

BrowserUk:

In the iterative solution, we're accumulating f(x)=(1-e^(-x/N))^h over the range of x=0 .. NumSamples. That's a rough form of computing the definite integral of the expression. Integrating over three variables (x, N, h) would be a pain, so I treated N and h as constants.

So first, we multiply out our f(x) expression to remove the exponent. So using h as 2, we get:


f(x) = 1 - 2e^(-x/N) + e^(-2x/N)

integ(f(x)) = integ( 1 - 2e^(-x/N) + e^(-2x/N) )
            = integ( 1 ) - 2*(integ(e^(-x/N))) + integ(e^(-2x/N))

since:
integ(1) = x + C
integ(e^(bx)) = (1/b)e^(bx) + C

integ(f(x)) = [ x+C ] + [ (-2N)e^(-x/N) + C ] + [ (-N/2)e^(-2x/N) + C 
+]
[download]

Computing a definite integral over a range is simply integ(f(x)) evaluated at the upper limit less the value evaluated at the lower limit. This causes the C terms to cancel.

Pascal's triangle comes out because we've got (a+b)^n, and when we multiply it out, we get the binomial expansion which is where the coefficients come into play.

One point I should mention: You don't have to use 0 as the lower bound. If you wanted the number of collisions you'd experience from sample A to sample B, just evaluate integ(f(B))-integ(f(A)). By using A=0 we compute the number of collisions for the entire run.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

[reply]
[d/l]

Thanks. As may be (becoming)

However, Now I know how


That allowed me to investigate

And the results are quite Using N x 2**32 vectors: 12 :         24 + 16 :         32 + 24 :         48 + 32 :         64 + 48 :         95 + 64 :        127 + 96 :        191 + 128 :        255 + 192 :        383 + 256 :        511 + 384 :        767 + 512 :       1023 + 768 :       1535 + 1024 :       2047 + 1536 :       3071 + 2048 :       4095 + 3072 :       6143 + 4096 :       8191 + 6144 :      12287 + 8192 :      16383 + 12288 :      24575 + 16384 :      32767 + 24576 :      49151 + 32768 :      65535 + 49152 :      98303 + 65536 :     131071 + 98304 :     196606 + 131072 :     262142 + 196608 :     393211 + 262144 :     524280 + 393216 :     786414 + 524288 :    1048544 + 786432 :    1572792 + 1048576 :    2097024 + 1572864 :    3145440 + 2097152 :    4193792 + 3145728 :    6290304 + 4194304 :    8386560 + 6291456 :   12578306 + 8388608 :   16769029 + 12582912 :   25147409 + 16777216 :   33521706 + 25165824 :   50258063 + 33554432 :   66978132 + 50331648 :  100369532 + 67108864 :  133696160 + 100663296 :  200156106 + 134217728 :  266359979 + 201326592 :  398007464 + 268435456 :  528654369 + 402653184 :  787008255 + 536870912 : 1041542872 + 805306368 : 1539620713 + 1073741824 : 2023785226 + 1610612736 : 2953695056 + 2147483648 : 3837421596 + 3221225472 : 5487393872 +13112112 4294967296 : 7009904423 +56541428 
I suspect it is a coding

Here's a set using 1 Using N x 2**16 vectors: 12 :     23      0     23 +   24 16 :     31      0     31 +   32 24 :     47      0     47 +   48 32 :     63      0     63 +   64 48 :     95      0     95 +   95 64 :    127      0    127 +  128 96 :    191      0    191 +  192 128 :    255      0    255 +  256 192 :    383      0    383 +  384 256 :    511      0    511 +  512 384 :    766      0    767 +  768 512 :   1022      0   1023 + 1024 768 :   1531      0   1535 + 1536 1024 :   2040      0   2047 + 2048 1536 :   3054      0   3071 + 3072 2048 :   4064      0   4095 + 4095 3072 :   6073      2   6143 + 6144 4096 :   8066      5   8191 + 8191 6144 :  12008     16  12286 +12287 8192 :  15892     38  16380 +16383 12288 :  23492    125  24559 +24575 16384 :  30880    284  32720 +32767 24576 :  45069    877  48942 +49151 32768 :  58554   1908  64958 +65535 49152 :  83730   5450  96062 +98282 65536 : 106962  11016 125573 +30912 
Sorry for the wrapping.

  With the rise and rise Examine what is said, "Science is about questioning In the absence of evidence, [download]

the affect of using fewer or more hashes without having to hand code a new function for each. interesting. This shows that each new (pair?) of hashes added does increase the discrimination substantially, though the gains obviously fall off fairly rapidly. But the really interesting part are the numbers for odd numbers of hashes:

C:\test>rbtcs-form-verify -H=10 -B=32 0         24          0         24    0         24          0         24          0 0         31          0         31    0         31          0         31          0 0         48          0         48    0         48          0         48          0 0         64          0         64    0         64          0         64          0 0         95          0         95    0         95          0         96          0 0        127          0        127    0        128          0        128          0 0        191          0        192    0        192          0        192          0 0        256          0        256    0        256          0        256          0 0        383          0        383    0        384          0        384          0 0        512          0        512    0        512          0        512          0 0        768          0        768    0        768          0        768          0 0       1024          0       1024    0       1024          0       1024          0 0       1536          0       1536    0       1536          0       1536          0 0       2048          0       2048    0       2048          0       2048          0 0       3072          0       3072    0       3072          0       3072          0 0       4096          0       4096    0       4096          0       4096          0 0       6144          0       6144    0       6144          0       6144          0 0       8192          0       8192    0       8192          0       8192          0 0      12288          0      12288    0      12288          0      12288          0 0      16384          0      16383    0      16383          0      16384          0 0      24576          0      24576    0      24576          0      24576          0 0      32767          0      32767    0      32767          0      32768          0 0      49151          0      49152    0      49152          0      49152          0 0      65536          0      65536    0      65536          0      65536          0 0      98304          0      98304    0      98303          0      98303          0 0     131072          0     131071    0     131072          0     131072          0 0     196608          0     196607    0     196608          0     196608          0 0     262144          0     262144    0     262143          0     262144          0 0     393216          0     393216    0     393215          0     393216          0 0     524288          0     524287    0     524287          0     524288          0 0     786431          0     786431    0     786432          0     786432          0 0    1048575          0    1048576    0    1048576          0    1048576          0 0    1572863          0    1572864    0    1572864          0    1572864          0 0    2097151          0    2097151    0    2097152          0    2097152          0 0    3145727          0    3145728    0    3145727          0    3145728          0 0    4194303          0    4194304    0    4194304          0    4194304          0 0    6291455          0    6291455    0    6291455          0    6291455          0 1    8388607          0    8388608    0    8388608          0    8388608          0 4   12582911          0   12582912    0   12582912          0   12582912          0 10   16777215          0   16777216    0   16777215          0   16777216          0 35   25165823          0   25165823    0   25165824          0   25165824          0 85   33554431          0   33554431    0   33554431          0   33554432          0 286   50331646          0   50331647    0   50331648          0   50331648          0 678   67108860          0   67108863    0   67108864          0   67108864          0 2283  100663276          0  100663295    0  100663296          0  100663296          0 5397  134217665          0  134217727    0  134217728          0  134217728          0 18111  201326276          5  201326591    0  201326591          0  201326592          0 42681  268434469         24  268435455    0  268435455          0  268435455          0 142383  402648282        179  402653177    0  402653183          0  402653184          0 333608  536855705        738  536870874    1  536870911          0  536870911          0 1100214  805232175       5328  805305969    30  805306365          0  805306367          0 2548692 1073515796      21337 1073739728    211 1073741802          2 1073741823          0 8218836 1609548824     146448 1610591774    3082 1610612273         70 1610612725          1 18623993 2144354575     558473 2147380065    19731 2147479815        755 2147483497         30 57532294 3207472286    3485537 3220308576    247526 3221157361      19018 3221220099       1532 125076314 4257101987   12129165 4290939237    1371778 4294491373     167491 4294907677      21417 357203949 6295545197   63685472 6413893311    6436324179    2901774 6441061714     670967 721946381 8229596479  188903097 8487725572    8558136669   18112153 8579512265    6047518 href="?part=4;node_id=963260;abspart=1;displaytype=displaycode">[download]

error on my behalf, but I guess it could be a quirk of the numbers? .. 16 2**16 bit hashes:C:\test>rbtcs-form-verify -H=16 -B=16 0 24 0 24 0 0 23 0 23 0 23 0 0 31 0 32 0 0 32 0 32 0 32 0 0 48 0 47 0 0 47 0 47 0 47 0 0 64 0 64 0 0 64 0 64 0 64 0 0 95 0 95 0 0 96 0 96 0 96 0 0 128 0 128 0 0 127 0 128 0 127 0 0 192 0 192 0 0 192 0 191 0 191 0 0 256 0 255 0 0 256 0 256 0 256 0 0 383 0 384 0 0 384 0 384 0 384 0 0 511 0 511 0 0 511 0 512 0 512 0 0 767 0 768 0 0 767 0 768 0 767 0 0 1023 0 1024 0 0 1024 0 1024 0 1024 0 0 1535 0 1536 0 0 1536 0 1536 0 1536 0 0 2047 0 2048 0 0 2048 0 2048 0 2048 0 0 3071 0 3071 0 0 3072 0 3072 0 3072 0 0 4095 0 4095 0 0 4096 0 4096 0 4096 0 0 6143 0 6143 0 0 6144 0 6144 0 6144 0 0 8191 0 8191 0 0 8191 0 8191 0 8192 0 0 12287 0 12287 0 0 12287 0 12288 0 12287 0 0 16383 0 16383 0 0 16383 0 16383 0 16383 0 2 24575 0 24575 0 0 24575 0 24575 0 24576 0 8 32766 0 32767 0 0 32767 0 32767 0 32767 0 53 49138 3 49150 0 0 49151 0 49151 0 49151 0 185 65474 20 65528 2 0 65535 0 65535 0 65535 0 971 97868 200 98210 44 10 98299 2 98302 0 98303 0 2882 129512 862 130586 276 1 92 131018 31 131053 11 131065 3 href="?node_id=963260;part=5;displaytype=displaycode;abspart=1">[download] class="pmsig-171588"> of 'Social' network sites: 'Computers are making people easier to use everyday' not who speaks -- Silence betokens consent -- Love the truth but pardon error. the status quo. Questioning authority". opinion is indistinguishable from prejudice. www.theregister.co.uk/2011/11/29/sas_versus_world_programming/">The start of some sanity?

[reply]
[d/l]
[select]