comment on

A simple mechanism for producing a signature from a subset of a large collection (eg. strings of characters) is a weighted sum calculation. Ie. assign each character a numeric value; multiply each character in the strings value by a weight, say its position in the string, and sum the resultant products:

sub sig{ 
    my $sig = 0; 
    my $n = 0; 
    $sig += ( $_ - 96 ) * ++$n for unpack 'C*', $_[0]; 
    return $sig; 
};;

print $string = join '', map chr( 97+rand( 26) ), 1 .. 10;;
oqqcfuwfco

print $sig = sig( $string );;
654

print $string = join '', map chr( 97+rand( 26) ), 1 .. 10;;
qwfiedngda

print $sig = sig( $string );;
366
[download]

That's okay as far as it goes, but it produces a bell curve with lots of duplicate signatures:

use Algorithm::Combinatorics qw[ combinations_with_repetition ];;

$i = combinations_with_repetition( [ 'a'..'z' ], 6 );;

undef %stats; 
++$stats{ sig( join'', @$comb ) } while defined( $comb = $i->next ); 

printf "Ave. collisions:%.2f Max. collisions:%.2f\n", 736281 / keys(%s
+tats), max( values %stats );;
Ave. collisions:1438.05 Max. collisions:4061.00

pp\%stats;;
[download]

Abbreviated output:

For my purposes, that collision ratio is far too high, so I set about looking for a way to reduce it. Obviously, I could calculate the positional power of the maximum number of characters and generate a guaranteed unique number, but the number of unique items in my collection is ~2e5, the subsets can be hundreds of elements each, so the signature values would rapidly become so huge as to be unmanageable. Indeed, when stringified longer than the subsets they are meant precise.

My first though was to replace the monotonic sequential weights with the first N prime numbers:

and that certainly reduces the collisions markedly:

C:\test>junk52 -K=6 -M=1 -S=0
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
+ 101

736000  Ave. collisions: 586.678088; Max. collisions:2567.000000
[download]

Then I though that instead of using sequential primes, it might help if I chose every other prime, or more wildly spaced, and that worked very well. Each doubling of the spacing between prime weights pretty much halving both the average and maximum number of collisions:

C:\test>junk52 -K=6 -M=1 -S=0
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
+ 101

736000  Ave. collisions: 586.678088; Max. collisions:2567.000000

C:\test>junk52 -K=6 -M=2 -S=0
3 7 13 19 29 37 43 53 61 71 79 89 101 107 113 131 139 151 163 173 181 
+193 199 223 229 239

736000  Ave. collisions: 439.046512; Max. collisions:1400.000000

C:\test>junk52 -K=6 -M=4 -S=0
7 19 37 53 71 89 107 131 151 173 193 223 239 263 281 311 337 359 383 4
+09 433 457 479 503 541 569

736000  Ave. collisions: 198.298142; Max. collisions:699.000000

C:\test>junk52 -K=6 -M=8 -S=0
19 53 89 131 173 223 263 311 359 409 457 503 569 613 659 719 769 827 8
+81 941 997 1049 1097 1163 1223 1283

736000  Ave. collisions: 93.377425; Max. collisions:468.000000

C:\test>junk52 -K=6 -M=16 -S=0
53 131 223 311 409 503 613 719 827 941 1049 1163 1283 1423 1511 1619 1
+747 1877 2003 2129 2267 2377 2503 2657 2741 2861

736000  Ave. collisions: 44.996700; Max. collisions:227.000000
[download]

That's great, but there is a limit to how far I can take it, as the weights get bigger, the sum of products starts to exceed my 64-bit integers for ever smaller subsets. So I then started to think about what else I could do to reduce the collisions, and wonders if shuffling the spaced primes would have any benefit, and that alone again halved the collision rates:

C:\test>junk52 -K=6 -M=16 -S=1
2741 2503 4481 7457 3671 7121 3413 6563 827 3301 7561 4909 3011 2657 3
+11 5167 223 6379 1877 2377 1619 7691 6271 1283 6007 3541

736000  Ave. collisions: 19.635207; Max. collisions:119.000000
[download]

Of course, that is only one sample of the possible shuffles, and so I ran bunch of reproducible shuffles to see whether all shuffles would be equal; and if not, could I see any pattern to the ordering of those that produced the best results:

I've done a bunch of runs like the one above and the results are remarkably consistent, with the average number of collisions rarely going below 18/sig or above 22/sig. On the face of things, seemingly, any shuffle is good enough to reduce the collision rate by a half relative to monotonically increasing spaced primes.

But therein lies my quandary, monotonically increasing values is just another of the possible shuffles; so why would every other shuffle produce so much better results? Then I looked more closely at the results I was getting (above and other runs) and in most runs there are one or two that exceed 23/sig average, and usually at most one that get below 18. In the case of the run above this results set stands out:

4751 613 409 53 1283 4057 719 7561 6563 7691 1511 5309 6007 7253 6133 
+941 6379 3923 223 1747 131 6971 1163 7853 6701 5167
736000  Ave. collisions: 16.439247; Max. collisions:108.000000
[download]

With the average collisions >1.5/sig lower than the next best; but there does not seem to be anything special about the ordering of the primes. (Note:The primes used are the same in all cases, only their ordering varies.)

So now I'm thinking that there may be an "optimal ordering" for any given set of prime multipliers, and I'm considering ways that I might discover that ordering.

The obvious programming approach to this goal would be to implement a GA for the problem. The scoring function is trivial, if not especially quick, and the selection criteria could be the average or maximum collision rates, or some weighted combination of the two. I favour the former.

But where I'm having trouble is trying to decide a good mutation strategy? Looking at the results sets above (and the others I've generated), there seems to be a great preponderance of average results, and very few 'good' ones; so I'm think that any kind of mass reordering of the good sets, as a mutation strategy, is simply likely to produce more average sets?

So, I'm thinking rather than mutating the top 10 or twenty results in some large way, I should produce the next generation by swapping several, single pairs of values in each of the (say) top three best results from the previous generation?

Does anyone have a better suggestion for a mutation strategy?

And finally (sorry this got so long), do any of the math-heads out there have any suggestions for a better -- perhaps mathematical -- method of choosing and/or ordering of the low prime weights? Or indeed, and other set of numerical weights that won't exceed my 64-bit integers when sum-producted?

Thanks for your time and considerations; any thoughts and suggestions are welcomed.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to An optimal solution to finding weights for a weighted sum signature. by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.