BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

A simple mechanism for producing a signature from a subset of a large collection (eg. strings of characters) is a weighted sum calculation. Ie. assign each character a numeric value; multiply each character in the strings value by a weight, say its position in the string, and sum the resultant products:

sub sig{ my $sig = 0; my $n = 0; $sig += ( $_ - 96 ) * ++$n for unpack 'C*', $_[0]; return $sig; };; print $string = join '', map chr( 97+rand( 26) ), 1 .. 10;; oqqcfuwfco print $sig = sig( $string );; 654 print $string = join '', map chr( 97+rand( 26) ), 1 .. 10;; qwfiedngda print $sig = sig( $string );; 366

That's okay as far as it goes, but it produces a bell curve with lots of duplicate signatures:

use Algorithm::Combinatorics qw[ combinations_with_repetition ];; $i = combinations_with_repetition( [ 'a'..'z' ], 6 );; undef %stats; ++$stats{ sig( join'', @$comb ) } while defined( $comb = $i->next ); printf "Ave. collisions:%.2f Max. collisions:%.2f\n", 736281 / keys(%s +tats), max( values %stats );; Ave. collisions:1438.05 Max. collisions:4061.00 pp\%stats;;

Abbreviated output:

For my purposes, that collision ratio is far too high, so I set about looking for a way to reduce it. Obviously, I could calculate the positional power of the maximum number of characters and generate a guaranteed unique number, but the number of unique items in my collection is ~2e5, the subsets can be hundreds of elements each, so the signature values would rapidly become so huge as to be unmanageable. Indeed, when stringified longer than the subsets they are meant precise.

My first though was to replace the monotonic sequential weights with the first N prime numbers:

and that certainly reduces the collisions markedly:

C:\test>junk52 -K=6 -M=1 -S=0 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 + 101 736000 Ave. collisions: 586.678088; Max. collisions:2567.000000

Then I though that instead of using sequential primes, it might help if I chose every other prime, or more wildly spaced, and that worked very well. Each doubling of the spacing between prime weights pretty much halving both the average and maximum number of collisions:

C:\test>junk52 -K=6 -M=1 -S=0 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 + 101 736000 Ave. collisions: 586.678088; Max. collisions:2567.000000 C:\test>junk52 -K=6 -M=2 -S=0 3 7 13 19 29 37 43 53 61 71 79 89 101 107 113 131 139 151 163 173 181 +193 199 223 229 239 736000 Ave. collisions: 439.046512; Max. collisions:1400.000000 C:\test>junk52 -K=6 -M=4 -S=0 7 19 37 53 71 89 107 131 151 173 193 223 239 263 281 311 337 359 383 4 +09 433 457 479 503 541 569 736000 Ave. collisions: 198.298142; Max. collisions:699.000000 C:\test>junk52 -K=6 -M=8 -S=0 19 53 89 131 173 223 263 311 359 409 457 503 569 613 659 719 769 827 8 +81 941 997 1049 1097 1163 1223 1283 736000 Ave. collisions: 93.377425; Max. collisions:468.000000 C:\test>junk52 -K=6 -M=16 -S=0 53 131 223 311 409 503 613 719 827 941 1049 1163 1283 1423 1511 1619 1 +747 1877 2003 2129 2267 2377 2503 2657 2741 2861 736000 Ave. collisions: 44.996700; Max. collisions:227.000000

That's great, but there is a limit to how far I can take it, as the weights get bigger, the sum of products starts to exceed my 64-bit integers for ever smaller subsets. So I then started to think about what else I could do to reduce the collisions, and wonders if shuffling the spaced primes would have any benefit, and that alone again halved the collision rates:

C:\test>junk52 -K=6 -M=16 -S=1 2741 2503 4481 7457 3671 7121 3413 6563 827 3301 7561 4909 3011 2657 3 +11 5167 223 6379 1877 2377 1619 7691 6271 1283 6007 3541 736000 Ave. collisions: 19.635207; Max. collisions:119.000000

Of course, that is only one sample of the possible shuffles, and so I ran bunch of reproducible shuffles to see whether all shuffles would be equal; and if not, could I see any pattern to the ordering of those that produced the best results:

I've done a bunch of runs like the one above and the results are remarkably consistent, with the average number of collisions rarely going below 18/sig or above 22/sig. On the face of things, seemingly, any shuffle is good enough to reduce the collision rate by a half relative to monotonically increasing spaced primes.

But therein lies my quandary, monotonically increasing values is just another of the possible shuffles; so why would every other shuffle produce so much better results? Then I looked more closely at the results I was getting (above and other runs) and in most runs there are one or two that exceed 23/sig average, and usually at most one that get below 18. In the case of the run above this results set stands out:

4751 613 409 53 1283 4057 719 7561 6563 7691 1511 5309 6007 7253 6133 +941 6379 3923 223 1747 131 6971 1163 7853 6701 5167 736000 Ave. collisions: 16.439247; Max. collisions:108.000000

With the average collisions >1.5/sig lower than the next best; but there does not seem to be anything special about the ordering of the primes. (Note:The primes used are the same in all cases, only their ordering varies.)

So now I'm thinking that there may be an "optimal ordering" for any given set of prime multipliers, and I'm considering ways that I might discover that ordering.

The obvious programming approach to this goal would be to implement a GA for the problem. The scoring function is trivial, if not especially quick, and the selection criteria could be the average or maximum collision rates, or some weighted combination of the two. I favour the former.

But where I'm having trouble is trying to decide a good mutation strategy? Looking at the results sets above (and the others I've generated), there seems to be a great preponderance of average results, and very few 'good' ones; so I'm think that any kind of mass reordering of the good sets, as a mutation strategy, is simply likely to produce more average sets?

So, I'm thinking rather than mutating the top 10 or twenty results in some large way, I should produce the next generation by swapping several, single pairs of values in each of the (say) top three best results from the previous generation?

Does anyone have a better suggestion for a mutation strategy?

And finally (sorry this got so long), do any of the math-heads out there have any suggestions for a better -- perhaps mathematical -- method of choosing and/or ordering of the low prime weights? Or indeed, and other set of numerical weights that won't exceed my 64-bit integers when sum-producted?

Thanks for your time and considerations; any thoughts and suggestions are welcomed.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: An optimal solution to finding weights for a weighted sum signature.
by Eily (Monsignor) on Nov 04, 2016 at 17:52 UTC

    It's not about primes, it's about big numbers, with distributions 1, 1000, 10000000, the first char will contribute to the LSBs, the second char to more significant bits and so on. The higher the ratio between those numbers, the more distance between the bit your chars can change. You see collisions when there is an overlap. For example with 26 values and weights of 1, 8, and 64, each char will directly affect a range of 8 bits, and each range will be at a distance of 3 bits from the next, with an overlap of 5. But with weights 1, 32 and 1024, you'll get no overlap in the range of bits each char can affect (so no carry).

    Actually the perfect list of weights in your example with chars a..z is 1, 26, 26^2, 26^3 .... But this will not fit into your integers as soon as 26^L > 2^64 with L the maximum length of your string.

    The general best distribution (no collisions, ignoring the limitation of 64 bits) is the list of K^N with K the number of possible characters. And I'm pretty sure the list of K^(N/C) should give you an average collision close to C (C strings for each value), but would require an integer with C times less bits.

      Actually the perfect list of weights in your example with chars a..z is 1, 26, 26^2, 26^3 .... But this will not fit into your integers as soon as 26^L > 2^64 with L the maximum length of your string.

      Yes. That is what I was attempting to describe in the OP when I said: "I could calculate the positional power of the maximum number of characters and generate a guaranteed unique number, but the number of unique items in my collection is ~2e5, the subsets can be hundreds of elements each, so the signature values would rapidly become so huge as to be unmanageable.".

      Using this schema on simple character data with an alphabet of just 26 chars means I would run over the limit of my 64-bit ints after only 13 characters.

      But as I said above, in a typical example of the type of problem I am trying to cater for, the 'alphabet' could be as large as 200,000 or even larger; and size of my subsets (aka. length of the strings), is frequently multiple hundreds or even low thousands. As such, there can be no "perfect set of weights".

      However, those numbers do hide some good news. With numbers this large, the number of combinations is ludicrously large, so even if I need to signature hundreds of billions of them, I will still only be processing a tiny, miniscule percentage of the total, so the likelihood that I would encounter a collision without it being an exact duplicate -- assuming even distributions -- is very slim indeed.

      To try and put some flesh on that assertion. Let's say my 'alphabet' is 200,000 (N) unique things; and my subsets each consist of some combination of 1000 (K) of those things, then the formula for the total number of combinations is (N+K-1)! / (K! * (N-K)!), which for those example numbers comes out at: 1.414e8029

      So even if I need to process one trillion of those, that's still only 0.00000000...{8000 zeros}...00000007%. False collisions are unlikely.

      Of course, totally even distributions in real world data are few and far between, but still it means that using a less than perfect signature mechanism is very likely to detect actual duplicates and rarely detect false ones.

      It's not about primes,

      Perhaps not mathematically, but practically it is. In part because I'm making it so. It is easy to demonstrate (with small subsets) that if I used powers of 2 weights, the incidence of collisions is vastly reduced:

      C:\test>weightedPrimeSumProduct -K=6 -M=16 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131 +072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 736000 Ave. collisions: 7.091558; Max. collisions:79.000000

      But that limits me to 'alphabets' of 63 chars at most; and practically to far less once sum of products plays out.

      I might consider trying to stretch the alphabet by assigning each power of two to two or more items in my collection, but that causes a vary rapid climb in the false positives:

      C:\test>weightedPrimeSumProduct -K=6 -M=16 1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 512 512 1024 1024 20 +48 2048 4096 4096 736000 Ave. collisions: 81.936457; Max. collisions:671.000000 C:\test>weightedPrimeSumProduct -K=6 -M=16 1 1 1 2 2 2 4 4 4 8 8 8 16 16 16 32 32 32 64 64 64 128 128 128 256 256 736000 Ave. collisions: 414.806197; Max. collisions:2931.000000

      Almost back to monotonic sequential weightings. And this is where the idea of using primes to expand the alphabet size that could be handled whilst reducing the chances of arithmetic collisions. The sum of multiples of unique primes is less likely to collide. And by using every second, third or fourth prime, the collisions reduce further. There is no real surprise in any of that.

      What did surprise me was the effect that shuffling the primes had on reducing the collision rates. And that bring me back to my questions in the OP regarding how to determine the best ordering of the chosen set of primes.

      I am quite convinced that a GA is the only game in town for arriving at a good, if not necessarily optimal ordering; but I'm still trying to wrap my head around a suitable mutation algorithm to allow the GA to arrive at a good solution within a reasonable time frame.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice.
      i

        With ~2e5, you'll have to do something beside the simple weighted sum (like a modulus) if you want to avoid the bell shape, which tends to be a thing with the sum of independant variables (with identical probability distribution - not your case with the weights - you'd actually get close to a gaussian distribution, a perfect bell)

        The sum of multiples of unique primes is less likely to collide.
        That's not a property of primes I know. The reason primes gives you better result is because their progression is close enough to an exponential one.

        What did surprise me was the effect that shuffling the primes had on reducing the collision rates.
        I actually believed this couldn't be true until I realized you are not working with fixed length input. If you were, you'd just have to apply the same permutation to the string as the one applied to the primes to obtain the same result:
        a b c d 3 7 17 29
        Is the same as:
        b d a c 7 29 3 17
        and so on. But with variable length input, the coefficients on the left are more likely to be used. So you obtain better result by having coefficients that are more distant in the front and the "less efficient" ones at the end. You can start by 1, and then add your coefficient by maximazing the ratio (the distance between the bits two characters will affect is log(K1)-log(K2), so an increasing function of (K1/K2)). Taking the powers of two up to 2^16 that could be: 2^0, 2^16, 2^8, 2^12, 2^4, 2^10, 2^6, 2^14, 2^2, 2^15, 2^1, 2^13, 2^3, 2^11, 2^5, 2^9, 2^7 The biggest ratio you can get with those numbers is of course 2^16. Then you take number that's halfway (in terms of ratio) between: 2^16. 2^12 and 2^4 both have a minimum ratio with one of the previous number of 2^4 (I put 2^12 first because it's the further away from 2^0, which is the coefficient with the highest probability to be used). Then 2^10, 2^6, 2^14 and 2^2. All the remaining order introduce two new 2^0 ratios.

        I hadn't been able to figure out why the shuffling helped you, either. Before Eily's post, I was going to lean down the road of suggesting comparing your monotonically increasing ramp of weights with a monotonically decreasing ramp of weights, and see whether collisions were affected. And then recommend trying a list of weights which was half ramp + half random (actually, three sets of half-ramp: #1:first half increasing, #2:middle half increasing (quarter random on each side), #3: final half increasing. (And compare all those to one of your fully-shuffled weight setes.) If you saw that the down-ramp and the the half-ramps all had worse collisions than your shuffle, then I would suggest trying a mutator that tried to get rid of rampy-segments: if you found a sequence of weights that were strictly increasing (or generally increasing, with occasional excursions), I would suggest having the mutator pick new random primes or just re-shuffle the rampy section.

        Also, I have found that histograms can hide some critical information. With your ordered set of 736000 combinations from the first example, you might want to comapre the time-series of the generated signatures -- probably with the multiple weight lists: ramp up, ramp down, 3 half ramps, and one or two completely random weights, to see if any patterns jump out and give you a hint for what your mutator could do.

        Finally, as a last thought: as Eily said, sums of primes don't approach uniqueness. However, products of primes do. I know that purely multiplying primes is unique. I spent some time while I couldn't sleep last night trying to come up with a reasonable way to use that, since you'd obviously quickly overrun your 64bit integer with a product of K=1000 primes. :-) I think I came up with something (after looking up a few primey facts today) that would be a product of two primes.

        • To fit in 64bits, each prime would have to be less than 2**32. The biggest 32bit prime number (1) is 4_294_967_291, which is the 203_280_221st (2) prime.
        • That number of multiplyable-primes is bigger than your expected N≥200_000, so in theory, we could assign two primes (or 1000) to each of your N things, and multiple two primes together.
        • If you didn't want to just start doling out primes randomly, you could do something like
          • @prime[1 .. 200_000] = import_from_list_of_first_million_primes()
          • generate two 1000-weight lists @wa and @wb, using your randomly shuffled primes
          • for each $comb, generate two sigs $sa and $sb in your sig function
            • even if one of your sigs had a collsion, the chances that both collide at the same time are significanlty smaller
            • return $product = $prime[$sa] * $prime[$sb]
Re: An optimal solution to finding weights for a weighted sum signature.
by Anonymous Monk on Nov 04, 2016 at 16:49 UTC

    Is there some reason you don't want to use an existing algorithm like CRC or MD5?

    In order to reduce the collision rate, it will help if you spread out the signature values over a larger output set. To keep from overflowing, you can choose a big modulus $M.

    $x = ($x + $P[$_] * $str[$_]) % $M for 0 .. $#str

    It's common to choose $M to be a power of two, so "% $M" becomes "& ($M - 1)". Choosing $M prime has some mathematical advantages, but as long as none of the @P values has any common factors with $M it works ok. If you set $P[$_] to $K**$_ for some $K, you won't even have to store @P.

      Is there some reason you don't want to use an existing algorithm like CRC or MD5?

      Yes. Whilst in my examples I'm using strings of letters for simplicity and conciseness; the actual collection could be pretty much anything. Eg. Instead of letters in a word; in might be the existence of and ordering of words in a sentence, or the existence of phrases, regardless of ordering in a document; or literally any other combinatorial subset of attributes of some datasets.

      In order to use crc or MD5 for these other kinds of subset collections, it would require two passes: one to encode (say hash lookup mapping) the collection subset into a 'string' digestible by those algorithms; and a second pass over that string to arrive at the signature.

      If I can accumulate the signature whilst iterating the first pass, I can save a substantial amount of time. In a previous attempt at a similar problem I came up with an algorithm -- very specific to the data being processed -- that proved to be almost 50 times faster than first stringifying then MD5. With a dataset measured in terabytes, that was a substantial saving.

      To keep from overflowing, you can choose a big modulus $M.

      That's a good suggestion, but there is a problem with it. With a pure (non-modulo) arithmetic signature, the values tend to group like-near-like in a nearly continuous ordering. It isn't a collating order that anyone would recognise, but similar subsets tend to be collate close to each other and transition gradually from one end of the spectrum to the other.

      However, once you start wrapping the signatures back on themselves using modulo arithmetic, that useful artifact of the non-modulo arithmetic disappears. It's not always needed or useful, but if I can avoid it, it might allow a more generic solution with additional useful properties.

      It is certainly something for me to keep in mind if another solution doesn't present itself. Thankyou.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Actually, both CRC and MD5 can be calculated incrementally without having to stringify the entire dataset at once. That's what the MD5 context object is for. I don't know of a CRC module that provides that functionality, but it's easy to implement.

        But yeah, coming up with a function that has the "nonrandom" behavior you're looking for is going to be very tricky and data-dependent.