Re^2: An optimal solution to finding weights for a weighted sum signature.

Actually the perfect list of weights in your example with chars a..z is 1, 26, 26^2, 26^3 .... But this will not fit into your integers as soon as 26^L > 2^64 with L the maximum length of your string.

Yes. That is what I was attempting to describe in the OP when I said: "I could calculate the positional power of the maximum number of characters and generate a guaranteed unique number, but the number of unique items in my collection is ~2e5, the subsets can be hundreds of elements each, so the signature values would rapidly become so huge as to be unmanageable.".

Using this schema on simple character data with an alphabet of just 26 chars means I would run over the limit of my 64-bit ints after only 13 characters.

But as I said above, in a typical example of the type of problem I am trying to cater for, the 'alphabet' could be as large as 200,000 or even larger; and size of my subsets (aka. length of the strings), is frequently multiple hundreds or even low thousands. As such, there can be no "perfect set of weights".

However, those numbers do hide some good news. With numbers this large, the number of combinations is ludicrously large, so even if I need to signature hundreds of billions of them, I will still only be processing a tiny, miniscule percentage of the total, so the likelihood that I would encounter a collision without it being an exact duplicate -- assuming even distributions -- is very slim indeed.

To try and put some flesh on that assertion. Let's say my 'alphabet' is 200,000 (N) unique things; and my subsets each consist of some combination of 1000 (K) of those things, then the formula for the total number of combinations is (N+K-1)! / (K! * (N-K)!), which for those example numbers comes out at: 1.414e8029

So even if I need to process one trillion of those, that's still only 0.00000000...{8000 zeros}...00000007%. False collisions are unlikely.

Of course, totally even distributions in real world data are few and far between, but still it means that using a less than perfect signature mechanism is very likely to detect actual duplicates and rarely detect false ones.

It's not about primes,

Perhaps not mathematically, but practically it is. In part because I'm making it so. It is easy to demonstrate (with small subsets) that if I used powers of 2 weights, the incidence of collisions is vastly reduced:

C:\test>weightedPrimeSumProduct -K=6 -M=16
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131
+072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432
736000  Ave. collisions: 7.091558; Max. collisions:79.000000
[download]

But that limits me to 'alphabets' of 63 chars at most; and practically to far less once sum of products plays out.

I might consider trying to stretch the alphabet by assigning each power of two to two or more items in my collection, but that causes a vary rapid climb in the false positives:

C:\test>weightedPrimeSumProduct -K=6 -M=16
1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 512 512 1024 1024 20
+48 2048 4096 4096
736000  Ave. collisions: 81.936457; Max. collisions:671.000000

C:\test>weightedPrimeSumProduct -K=6 -M=16
1 1 1 2 2 2 4 4 4 8 8 8 16 16 16 32 32 32 64 64 64 128 128 128 256 256
736000  Ave. collisions: 414.806197; Max. collisions:2931.000000
[download]

Almost back to monotonic sequential weightings. And this is where the idea of using primes to expand the alphabet size that could be handled whilst reducing the chances of arithmetic collisions. The sum of multiples of unique primes is less likely to collide. And by using every second, third or fourth prime, the collisions reduce further. There is no real surprise in any of that.

What did surprise me was the effect that shuffling the primes had on reducing the collision rates. And that bring me back to my questions in the OP regarding how to determine the best ordering of the chosen set of primes.

I am quite convinced that a GA is the only game in town for arriving at a good, if not necessarily optimal ordering; but I'm still trying to wrap my head around a suitable mutation algorithm to allow the GA to arrive at a good solution within a reasonable time frame.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^2: An optimal solution to finding weights for a weighted sum signature. Select or Download Code

Replies are listed 'Best First'.
Re^3: An optimal solution to finding weights for a weighted sum signature. by Eily (Monsignor) on Nov 05, 2016 at 22:58 UTC
With ~2e5, you'll have to do something beside the simple weighted sum (like a modulus) if you want to avoid the bell shape, which tends to be a thing with the sum of independant variables (with identical probability distribution - not your case with the weights - you'd actually get close to a gaussian distribution, a perfect bell) The sum of multiples of unique primes is less likely to collide. That's not a property of primes I know. The reason primes gives you better result is because their progression is close enough to an exponential one. What did surprise me was the effect that shuffling the primes had on reducing the collision rates. I actually believed this couldn't be true until I realized you are not working with fixed length input. If you were, you'd just have to apply the same permutation to the string as the one applied to the primes to obtain the same result: `a b c d 3 7 17 29` [download] Is the same as: `b d a c 7 29 3 17` [download] and so on. But with variable length input, the coefficients on the left are more likely to be used. So you obtain better result by having coefficients that are more distant in the front and the "less efficient" ones at the end. You can start by 1, and then add your coefficient by maximazing the ratio (the distance between the bits two characters will affect is log(K1)-log(K2), so an increasing function of (K1/K2)). Taking the powers of two up to 2^16 that could be: `2^0, 2^16, 2^8, 2^12, 2^4, 2^10, 2^6, 2^14, 2^2, 2^15, 2^1, 2^13, 2^3, 2^11, 2^5, 2^9, 2^7` The biggest ratio you can get with those numbers is of course 2^16. Then you take number that's halfway (in terms of ratio) between: 2^16. 2^12 and 2^4 both have a minimum ratio with one of the previous number of 2^4 (I put 2^12 first because it's the further away from 2^0, which is the coefficient with the highest probability to be used). Then 2^10, 2^6, 2^14 and 2^2. All the remaining order introduce two new 2^0 ratios.	[reply] [d/l] [select]
Re^4: An optimal solution to finding weights for a weighted sum signature. by BrowserUk (Patriarch) on Nov 06, 2016 at 00:04 UTC
The reason primes gives you better result is because their progression is close enough to an exponential one. Hm. With powers of 10, you allow for a alphabet of 20. With powers of 2, 64. Even if you move to powers of decimal values, you can still only cater for relatively small alphabets: eg 1.1^N exceeds 2^64 after N=466 : `$l = 264; $x = 1.1; $n = 1; 1 while ($x$n++) < $l; print $n;; 467` [download] My reason for choosing primes is there are approximately 416 quadrillion primes less than 2^64; enough to allow for even the largest alphabet and still be able to spread them out by using every 10th, 20th or even 100th prime. I actually believed this couldn't be true until I realized you are not working with fixed length input. I was confused as to what you meant by "not fixed length input"; but eventually the penny dropped. Yes, having more weights than there are members in each subset, and shuffling them all before selecting the first K to match the subset size, is why shuffling works. You can start by 1, and then add your coefficient by maximizing the ratio (the distance between the bits two characters will affect is log(K1)-log(K2), so an increasing function of (K1/K2)). That's an interesting theory. (Or fact? It's not clear from your post; and my math isn't strong enough to decide one way or the other.) If that process of 'interleaving' the weights approaches optimal, then when I run my GA on a large selection of primes, I would expect it to quite quickly start converging on numbers that come close to that kind of relationship. I did start a run of the GA this morning; but I used too large an alphabet (1000) and my mutation function was not good, so whilst it was slowly improving with each iteration, it was going to run for a very long time before getting anything noteworthy. I've had some more thoughts about the mutation strategy; and tomorrow I'll code that up and start a run using a much smaller alphabet (100 or 200) and see how things are looking after a couple of hours. It'll be really interesting of the results bear out your thesis. (Assuming my new mutation sub works.) Once I get some results worth sharing, I'll come back. BTW: Thank you for your time and input. This problem is one I've been playing with on and off going back years. At this point its not actually for anything in particular; but rather my attempt to find a generic or at least more general solution to a problem that I've had to solve many times down the years; and each time has had to be developed to be very specific to the exact dataset it has to handle. That's a pain in the rump because: 1) it can take a lot of effort and testing to develop an algorithm that works for a particular dataset, and then it cannot be reused; 2) if the dataset for a particular project changes over time, it can necessitate reiterating the development process to re-tailor it to the new state of play. My hope is that if I can find a reasonable (automated and doesn't take too long) way of choosing a set and ordering of primes for a particular size of alphabet it could save a great deal of effort in the future. my current notion goes something like this: Feed in the size of the alphabet; divide that into the number of primes less 2^64 and use half of the the result as a step size to select a bunch of widely spaced primes. Then run a GA on those primes over (say) 100 million random combinations and use the best ordering found by the GA for the actual processing. It's somewhere to start ) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: An optimal solution to finding weights for a weighted sum signature. by pryrt (Abbot) on Nov 06, 2016 at 00:12 UTC
I hadn't been able to figure out why the shuffling helped you, either. Before Eily's post, I was going to lean down the road of suggesting comparing your monotonically increasing ramp of weights with a monotonically decreasing ramp of weights, and see whether collisions were affected. And then recommend trying a list of weights which was half ramp + half random (actually, three sets of half-ramp: #1:first half increasing, #2:middle half increasing (quarter random on each side), #3: final half increasing. (And compare all those to one of your fully-shuffled weight setes.) If you saw that the down-ramp and the the half-ramps all had worse collisions than your shuffle, then I would suggest trying a mutator that tried to get rid of rampy-segments: if you found a sequence of weights that were strictly increasing (or generally increasing, with occasional excursions), I would suggest having the mutator pick new random primes or just re-shuffle the rampy section. Also, I have found that histograms can hide some critical information. With your ordered set of 736000 combinations from the first example, you might want to comapre the time-series of the generated signatures -- probably with the multiple weight lists: ramp up, ramp down, 3 half ramps, and one or two completely random weights, to see if any patterns jump out and give you a hint for what your mutator could do. Finally, as a last thought: as Eily said, sums of primes don't approach uniqueness. However, products of primes do. I know that purely multiplying primes is unique. I spent some time while I couldn't sleep last night trying to come up with a reasonable way to use that, since you'd obviously quickly overrun your 64bit integer with a product of K=1000 primes. :-) I think I came up with something (after looking up a few primey facts today) that would be a product of two primes. To fit in 64bits, each prime would have to be less than 2*32. The biggest 32bit prime number ⁽¹⁾ is 4_294_967_291, which is the 203_280_221^{st (2)} prime. That number of multiplyable-primes is bigger than your expected N≥200_000, so in theory, we could assign two primes (or 1000) to each of your N things, and multiple two primes together. If you didn't want to just start doling out primes randomly, you could do something like `@prime[1 .. 200_000] = import_from_list_of_first_million_primes()` generate two 1000-weight lists @wa and @wb, using your randomly shuffled primes for each $comb, generate two sigs $sa and $sb in your sig function even if one of your sigs had a collsion, the chances that both collide at the same time are significanlty smaller `return $product = $prime[$sa] $prime[$sb]`	[reply] [d/l] [select]
Re^4: An optimal solution to finding weights for a weighted sum signature. by BrowserUk (Patriarch) on Nov 06, 2016 at 00:58 UTC
I hadn't been able to figure out why the shuffling helped you, either. Before Eily's post, As I said, I was surprised at the result. The fact that I am not shuffling the same K primes as I had intended, but rather shuffling a set nK primes and then selecting the leftmost K of those, was a programming error, and one I didn't twig to until some time this morning. However, the benefits of it are a very nice side-effect of my error, and I'll take that with knobs on :) Finally, as a last thought: as Eily said, sums of primes don't approach uniqueness. However, products of primes do.* Agreed, but please note that I'm not "summing primes", but rather summing products of primes, and different multiples of different primes to boot. So, whilst the results aren't unique (otherwise my collision test would show none) with the right set of primes in the right order, and largish subsets, the results show that they do actually approach uniqueness for practical purposes. This is using 26 element subsets (using a 26-char alphabet) and picking every 384th prime from the first 10,000 and examining the collisions in the first 1 & 10 million: C:\test>weightedPrimeSumProduct -K=26 -M=384 -S=2417 -C=1e6 9311 32213 20393 52667 44269 36161 60953 56809 78193 40213 12907 48527 + 104579 91331 73939 69709 95603 16603 86969 65357 82493 100109 2657 2 +8181 5849 24137 1000000 Ave. collisions: 1.163441; Max. collisions:6.000000 C:\test>weightedPrimeSumProduct -K=26 -M=384 -S=2417 -C=10e6 9311 32213 20393 52667 44269 36161 60953 56809 78193 40213 12907 48527 + 104579 91331 73939 69709 95603 16603 86969 65357 82493 100109 2657 2 +8181 5849 24137 10000000 Ave. collisions: 2.654229; Max. collisions:17.000000 [download] Note:The shuffle ordering is not well chosen, just the first random number I typed used to initialise the PRNG. I realise that 1 & 10 million is a tiny proportion of the 413 septillion combinations, but that is actually fairly representative of the percentages I would expect for the much larger subsets and very much larger alphabets I'm targeting. I think I came up with something (after looking up a few primey facts today) that would be a product of two primes. On first reading, (and second and third) that appears to me to be genius. I love this place for ideas like that. Thank you! This day has been way to long for my old brain to really apply much in the way of analysis tonight; but I will tomorrow .... With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]