comment on

Actually the perfect list of weights in your example with chars a..z is 1, 26, 26^2, 26^3 .... But this will not fit into your integers as soon as 26^L > 2^64 with L the maximum length of your string.

Yes. That is what I was attempting to describe in the OP when I said: "I could calculate the positional power of the maximum number of characters and generate a guaranteed unique number, but the number of unique items in my collection is ~2e5, the subsets can be hundreds of elements each, so the signature values would rapidly become so huge as to be unmanageable.".

Using this schema on simple character data with an alphabet of just 26 chars means I would run over the limit of my 64-bit ints after only 13 characters.

But as I said above, in a typical example of the type of problem I am trying to cater for, the 'alphabet' could be as large as 200,000 or even larger; and size of my subsets (aka. length of the strings), is frequently multiple hundreds or even low thousands. As such, there can be no "perfect set of weights".

However, those numbers do hide some good news. With numbers this large, the number of combinations is ludicrously large, so even if I need to signature hundreds of billions of them, I will still only be processing a tiny, miniscule percentage of the total, so the likelihood that I would encounter a collision without it being an exact duplicate -- assuming even distributions -- is very slim indeed.

To try and put some flesh on that assertion. Let's say my 'alphabet' is 200,000 (N) unique things; and my subsets each consist of some combination of 1000 (K) of those things, then the formula for the total number of combinations is (N+K-1)! / (K! * (N-K)!), which for those example numbers comes out at: 1.414e8029

So even if I need to process one trillion of those, that's still only 0.00000000...{8000 zeros}...00000007%. False collisions are unlikely.

Of course, totally even distributions in real world data are few and far between, but still it means that using a less than perfect signature mechanism is very likely to detect actual duplicates and rarely detect false ones.

It's not about primes,

Perhaps not mathematically, but practically it is. In part because I'm making it so. It is easy to demonstrate (with small subsets) that if I used powers of 2 weights, the incidence of collisions is vastly reduced:

C:\test>weightedPrimeSumProduct -K=6 -M=16
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131
+072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432
736000  Ave. collisions: 7.091558; Max. collisions:79.000000
[download]

But that limits me to 'alphabets' of 63 chars at most; and practically to far less once sum of products plays out.

I might consider trying to stretch the alphabet by assigning each power of two to two or more items in my collection, but that causes a vary rapid climb in the false positives:

C:\test>weightedPrimeSumProduct -K=6 -M=16
1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 512 512 1024 1024 20
+48 2048 4096 4096
736000  Ave. collisions: 81.936457; Max. collisions:671.000000

C:\test>weightedPrimeSumProduct -K=6 -M=16
1 1 1 2 2 2 4 4 4 8 8 8 16 16 16 32 32 32 64 64 64 128 128 128 256 256
736000  Ave. collisions: 414.806197; Max. collisions:2931.000000
[download]

Almost back to monotonic sequential weightings. And this is where the idea of using primes to expand the alphabet size that could be handled whilst reducing the chances of arithmetic collisions. The sum of multiples of unique primes is less likely to collide. And by using every second, third or fourth prime, the collisions reduce further. There is no real surprise in any of that.

What did surprise me was the effect that shuffling the primes had on reducing the collision rates. And that bring me back to my questions in the OP regarding how to determine the best ordering of the chosen set of primes.

I am quite convinced that a GA is the only game in town for arriving at a good, if not necessarily optimal ordering; but I'm still trying to wrap my head around a suitable mutation algorithm to allow the GA to arrive at a good solution within a reasonable time frame.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^2: An optimal solution to finding weights for a weighted sum signature. by BrowserUk
in thread An optimal solution to finding weights for a weighted sum signature. by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.