BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

A simple mechanism for producing a signature from a subset of a large collection (eg. strings of characters) is a weighted sum calculation. Ie. assign each character a numeric value; multiply each character in the strings value by a weight, say its position in the string, and sum the resultant products:

sub sig{ my $sig = 0; my $n = 0; $sig += ( $_ - 96 ) * ++$n for unpack 'C*', $_[0]; return $sig; };; print $string = join '', map chr( 97+rand( 26) ), 1 .. 10;; oqqcfuwfco print $sig = sig( $string );; 654 print $string = join '', map chr( 97+rand( 26) ), 1 .. 10;; qwfiedngda print $sig = sig( $string );; 366

That's okay as far as it goes, but it produces a bell curve with lots of duplicate signatures:

use Algorithm::Combinatorics qw[ combinations_with_repetition ];; $i = combinations_with_repetition( [ 'a'..'z' ], 6 );; undef %stats; ++$stats{ sig( join'', @$comb ) } while defined( $comb = $i->next ); printf "Ave. collisions:%.2f Max. collisions:%.2f\n", 736281 / keys(%s +tats), max( values %stats );; Ave. collisions:1438.05 Max. collisions:4061.00 pp\%stats;;

Abbreviated output:

pp \%stats;; { 21 => 1, 27 => 1, 32 => 1, 33 => 1, 36 => 1, 38 => 1, 39 => 2, 41 => 1, 42 => 2, 43 => 1, 44 => 1, 45 => 2, 47 => 2, 48 => 2, 49 => 1, 50 => 2, 51 => 3, 52 => 1, 53 => 3, 54 => 4, 55 => 1, 56 => 3, 57 => 5, 58 => 2, 59 => 4, 60 => 5, 61 => 3, 62 => 5, 63 => 7, 64 => 3, 65 => 6, 66 => 6, 67 => 4, 68 => 7, 69 => 9, 70 => 4, 71 => 8, 72 => 10, 73 => 6, 74 => 10, ... 302 => 3024, 303 => 3065, 304 => 2978, 305 => 3110, 306 => 3140, 307 => 3060, 308 => 3188, 309 => 3226, 310 => 3136, 311 => 3267, 312 => 3300, 313 => 3217, 314 => 3344, 315 => 3379, 316 => 3286, 317 => 3419, 318 => 3449, 319 => 3363, 320 => 3489, 321 => 3522, 322 => 3427, 323 => 3561, 324 => 3588, 325 => 3497, 326 => 3621, 327 => 3654, 328 => 3557, 329 => 3687, 330 => 3712, 331 => 3616, 332 => 3741, 333 => 3772, 334 => 3671, 335 => 3798, 336 => 3818, 337 => 3722, 338 => 3845, 339 => 3872, 340 => 3766, 341 => 3890, 342 => 3910, 343 => 3811, 344 => 3930, 345 => 3951, 346 => 3841, 347 => 3965, 348 => 3982, 349 => 3878, 350 => 3991, 351 => 4009, 352 => 3899, 353 => 4019, 354 => 4030, 355 => 3921, 356 => 4031, 357 => 4047, 358 => 3934, 359 => 4047, 360 => 4053, 361 => 3942, 362 => 4049, 363 => 4061, 364 => 3943, 365 => 4050, 366 => 4052, 367 => 3940, 368 => 4042, 369 => 4047, 370 => 3926, 371 => 4027, 372 => 4027, 373 => 3912, 374 => 4006, 375 => 4007, 376 => 3883, 377 => 3980, 378 => 3975, 379 => 3857, 380 => 3944, 381 => 3939, 382 => 3816, 383 => 3906, 384 => 3896, 385 => 3774, 386 => 3854, 387 => 3847, 388 => 3722, 389 => 3805, 390 => 3788, 391 => 3665, 392 => 3740, 393 => 3729, 394 => 3602, 395 => 3675, 396 => 3655, 397 => 3533, 398 => 3601, 399 => 3584, 400 => 3456, 401 => 3521, 402 => 3499, 403 => 3378, 404 => 3437, 405 => 3414, 406 => 3286, 407 => 3347, 408 => 3320, 409 => 3199, 410 => 3249, 411 => 3223, 412 => 3099, 413 => 3151, 414 => 3121, ... 534 => 10, 535 => 7, 536 => 7, 537 => 6, 538 => 4, 539 => 4, 540 => 4, 541 => 2, 542 => 2, 543 => 2, 544 => 1, 545 => 1, 546 => 1, }

For my purposes, that collision ratio is far too high, so I set about looking for a way to reduce it. Obviously, I could calculate the positional power of the maximum number of characters and generate a guaranteed unique number, but the number of unique items in my collection is ~2e5, the subsets can be hundreds of elements each, so the signature values would rapidly become so huge as to be unmanageable. Indeed, when stringified longer than the subsets they are meant precise.

My first though was to replace the monotonic sequential weights with the first N prime numbers:

and that certainly reduces the collisions markedly:

C:\test>junk52 -K=6 -M=1 -S=0 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 + 101 736000 Ave. collisions: 586.678088; Max. collisions:2567.000000

Then I though that instead of using sequential primes, it might help if I chose every other prime, or more wildly spaced, and that worked very well. Each doubling of the spacing between prime weights pretty much halving both the average and maximum number of collisions:

C:\test>junk52 -K=6 -M=1 -S=0 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 + 101 736000 Ave. collisions: 586.678088; Max. collisions:2567.000000 C:\test>junk52 -K=6 -M=2 -S=0 3 7 13 19 29 37 43 53 61 71 79 89 101 107 113 131 139 151 163 173 181 +193 199 223 229 239 736000 Ave. collisions: 439.046512; Max. collisions:1400.000000 C:\test>junk52 -K=6 -M=4 -S=0 7 19 37 53 71 89 107 131 151 173 193 223 239 263 281 311 337 359 383 4 +09 433 457 479 503 541 569 736000 Ave. collisions: 198.298142; Max. collisions:699.000000 C:\test>junk52 -K=6 -M=8 -S=0 19 53 89 131 173 223 263 311 359 409 457 503 569 613 659 719 769 827 8 +81 941 997 1049 1097 1163 1223 1283 736000 Ave. collisions: 93.377425; Max. collisions:468.000000 C:\test>junk52 -K=6 -M=16 -S=0 53 131 223 311 409 503 613 719 827 941 1049 1163 1283 1423 1511 1619 1 +747 1877 2003 2129 2267 2377 2503 2657 2741 2861 736000 Ave. collisions: 44.996700; Max. collisions:227.000000

That's great, but there is a limit to how far I can take it, as the weights get bigger, the sum of products starts to exceed my 64-bit integers for ever smaller subsets. So I then started to think about what else I could do to reduce the collisions, and wonders if shuffling the spaced primes would have any benefit, and that alone again halved the collision rates:

C:\test>junk52 -K=6 -M=16 -S=1 2741 2503 4481 7457 3671 7121 3413 6563 827 3301 7561 4909 3011 2657 3 +11 5167 223 6379 1877 2377 1619 7691 6271 1283 6007 3541 736000 Ave. collisions: 19.635207; Max. collisions:119.000000

Of course, that is only one sample of the possible shuffles, and so I ran bunch of reproducible shuffles to see whether all shuffles would be equal; and if not, could I see any pattern to the ordering of those that produced the best results:

C:\test>for /l %s in (1,17,1000) do @junk52 -K=6 -M=16 -S=%s 2741 2503 4481 7457 3671 7121 3413 6563 827 3301 7561 4909 3011 2657 3 +11 5167 223 6379 1877 2377 1619 7691 6271 1283 6007 3541 736000 Ave. collisions: 19.635207; Max. collisions:119.000000 3671 3797 1511 4057 311 2003 1283 223 1423 7457 3301 5573 3541 7691 13 +1 1747 5167 6379 7561 1163 1049 6133 3923 4211 6271 4481 736000 Ave. collisions: 20.188676; Max. collisions:134.000000 941 719 7561 503 4909 6563 4621 7253 3413 1877 5443 1423 3541 409 1511 + 4337 1049 3301 2503 311 6007 6271 6833 3011 7457 5011 736000 Ave. collisions: 19.603839; Max. collisions:123.000000 5011 2741 3167 6007 6563 3413 6971 1747 7253 4057 2267 6271 409 3541 4 +337 7853 613 4751 5167 2657 5443 1283 6379 7457 223 5573 736000 Ave. collisions: 19.998397; Max. collisions:141.000000 941 503 6133 3671 7457 2129 613 2741 7561 1423 223 5849 2503 2267 3413 + 827 2861 3923 1511 719 2003 311 6379 7853 6701 131 736000 Ave. collisions: 18.564826; Max. collisions:121.000000 6701 7691 2267 6007 2003 503 4057 4337 1283 2503 1163 5849 6971 6271 6 +563 7853 1511 4909 3011 7253 7457 7561 3797 941 4621 311 736000 Ave. collisions: 19.263278; Max. collisions:146.000000 4211 4621 7121 1619 3301 719 7691 7561 827 1049 2861 2657 5011 6833 50 +3 6133 2267 311 2129 1283 2503 3167 4751 5849 941 409 736000 Ave. collisions: 18.665543; Max. collisions:106.000000 941 7457 2741 719 4057 1619 1747 4751 827 6271 3011 1049 4621 6379 265 +7 5573 6133 1877 3671 2267 7561 6971 6007 311 4909 613 736000 Ave. collisions: 18.402424; Max. collisions:109.000000 613 311 7561 7253 5309 7121 3011 4621 2861 4337 503 2741 3167 6833 367 +1 2377 131 1423 1163 1747 7853 6007 4751 3541 1511 6133 736000 Ave. collisions: 19.277904; Max. collisions:123.000000 2503 6271 3797 6701 2741 2129 2003 1619 3301 7853 613 2267 6007 3671 4 +621 6563 1423 6379 3167 3923 1747 4211 7253 2377 4057 4751 736000 Ave. collisions: 21.288411; Max. collisions:136.000000 4909 1423 2861 4211 7121 1511 6271 2741 1747 3923 719 5011 5309 4337 5 +849 5711 2003 7253 3541 6833 4481 3797 311 613 6701 1163 736000 Ave. collisions: 20.972484; Max. collisions:141.000000 4751 7253 5167 1511 5011 1747 5443 3167 4057 2741 3541 2129 53 1163 68 +33 6133 6563 5309 2503 1283 7457 3413 4337 1619 3923 4481 736000 Ave. collisions: 21.680830; Max. collisions:175.000000 2129 1747 3797 2267 7853 1049 4211 4481 2003 3301 3671 7253 4621 1511 +6833 5849 4751 2377 3541 7691 5011 6007 1163 53 7121 1619 736000 Ave. collisions: 21.578529; Max. collisions:153.000000 4337 53 4057 5573 6563 3413 4751 223 2267 3011 1049 6271 6007 2503 530 +9 3797 1423 1163 2861 3167 6971 3541 409 6379 6133 1747 736000 Ave. collisions: 19.616374; Max. collisions:133.000000 2377 3671 7691 311 2267 1747 1283 3797 1877 4211 1423 1619 5167 7561 2 +741 1163 6833 719 6007 6271 1511 3167 7253 5573 3301 2003 736000 Ave. collisions: 23.313311; Max. collisions:186.000000 1619 2503 5011 5309 3011 6007 503 2377 719 4909 2129 2003 1163 3413 71 +21 6971 3541 4751 3301 6379 5167 941 5711 311 6833 2267 736000 Ave. collisions: 20.357250; Max. collisions:137.000000 3923 4057 7561 53 2267 6971 6563 4751 5711 7853 3413 2129 719 5573 421 +1 2657 3011 1877 7121 5443 2861 2377 503 1283 4909 2503 736000 Ave. collisions: 20.486394; Max. collisions:135.000000 1049 827 2129 7121 1619 1877 53 3671 6701 719 1511 3413 311 4057 4211 +6833 2503 2003 7253 5011 5849 5167 3011 6971 1163 2657 736000 Ave. collisions: 19.818072; Max. collisions:128.000000 503 2377 6971 4909 1163 2267 5167 3413 719 4621 409 6833 3301 1283 571 +1 6133 7121 941 311 4211 3167 223 4751 5849 4057 2861 736000 Ave. collisions: 20.875560; Max. collisions:119.000000 2741 3301 6563 6007 1283 3797 5167 2129 2503 4211 5011 4751 3011 941 6 +133 7561 3923 4621 1163 7121 1423 3167 613 1619 1747 4909 736000 Ave. collisions: 22.637387; Max. collisions:154.000000 2129 6701 3011 1877 3671 7121 5711 4909 1049 3797 1511 4481 1619 6133 +6833 1423 5573 5443 131 2861 827 7691 7253 5849 4057 7853 736000 Ave. collisions: 20.488104; Max. collisions:120.000000 3797 4057 4481 827 4211 503 5711 7561 6271 6833 719 4751 1423 131 2267 + 4337 1163 5011 1049 2503 2861 2003 7457 941 2377 3011 736000 Ave. collisions: 20.286576; Max. collisions:150.000000 409 6563 4909 1283 6701 53 5309 7121 4057 4751 5573 311 1049 6971 4621 + 7691 3923 3167 2741 4481 2267 3797 719 6271 2657 613 736000 Ave. collisions: 19.802614; Max. collisions:133.000000 3413 1511 6271 1049 6833 2741 4751 4057 4337 7853 6007 1423 1619 3541 +2503 4909 2657 3797 2861 2003 1747 3011 53 311 131 5011 736000 Ave. collisions: 21.646410; Max. collisions:175.000000 409 3923 4909 5309 131 4337 7457 7121 613 719 2129 6563 3797 6133 4621 + 5849 1619 4057 7561 6271 1163 2657 3301 1283 5443 941 736000 Ave. collisions: 19.018960; Max. collisions:122.000000 4481 6007 2503 4337 1511 5309 1049 5711 2003 2129 6833 7121 5573 3167 +1747 2267 2861 1877 3541 311 3923 3671 827 6271 5849 3301 736000 Ave. collisions: 23.165146; Max. collisions:182.000000 1163 53 1877 4751 1283 311 409 4211 6133 6379 6833 3167 1511 6701 5573 + 6971 3541 7121 3797 2377 7853 1619 131 613 827 7691 736000 Ave. collisions: 17.260901; Max. collisions:111.000000 2657 827 4211 7253 3011 5309 4337 6271 1423 2003 1747 3541 409 6007 36 +71 5443 7853 941 5711 5849 6563 3797 3301 7561 7121 1619 736000 Ave. collisions: 19.355442; Max. collisions:111.000000 3923 2503 6379 719 5573 827 613 1049 3301 1163 7853 1283 6133 6007 409 + 6563 6833 4481 7121 2129 1511 2003 3011 6701 4909 1747 736000 Ave. collisions: 19.884976; Max. collisions:140.000000 2741 719 5711 7561 2657 6379 6563 1877 503 613 2129 4211 7691 1283 448 +1 4621 5011 3413 7853 1163 7121 4337 223 6833 4057 5573 736000 Ave. collisions: 18.342825; Max. collisions:119.000000 941 4481 6563 5167 7691 1877 6007 1163 2861 223 3541 4621 5711 4751 16 +19 3671 4211 1511 1283 5849 503 2377 2657 2741 6701 2267 736000 Ave. collisions: 22.146454; Max. collisions:146.000000 4751 613 409 53 1283 4057 719 7561 6563 7691 1511 5309 6007 7253 6133 +941 6379 3923 223 1747 131 6971 1163 7853 6701 5167 736000 Ave. collisions: 16.439247; Max. collisions:108.000000 2129 1423 5711 7691 409 3413 719 3301 503 2741 2267 2503 3011 4751 316 +7 3797 7561 613 4337 827 5167 223 7253 6133 2861 5309 736000 Ave. collisions: 19.102351; Max. collisions:111.000000 6971 6271 409 3797 131 6007 5573 2003 503 6379 6833 613 827 5849 1619 +6701 941 7561 4057 2741 1423 1511 3413 5309 4481 5443 736000 Ave. collisions: 18.401505; Max. collisions:108.000000 5711 2741 1423 2861 7253 5309 6133 2657 6007 1877 2267 1049 409 4211 7 +19 223 4909 5849 7691 3167 827 3923 1283 3011 3671 131 736000 Ave. collisions: 21.309360; Max. collisions:129.000000 4337 2003 1423 2657 827 719 3797 2129 5711 7121 941 6133 53 2267 2741 +3301 7253 3413 3671 6833 7853 5309 409 3011 6563 5573 736000 Ave. collisions: 21.534351; Max. collisions:147.000000 7853 6563 4211 2861 131 3413 6833 5309 3797 3671 1049 3167 4337 1747 1 +619 223 3541 7121 719 1877 5167 2741 5849 4621 3923 409 736000 Ave. collisions: 20.528090; Max. collisions:135.000000 3301 7853 3797 4057 2003 719 6271 2503 6379 5443 6133 827 6971 7457 53 + 3167 2129 3011 1747 2657 1619 2377 4751 941 1049 7121 736000 Ave. collisions: 19.801549; Max. collisions:115.000000 1619 3923 7253 2741 941 719 3671 5573 5443 2861 2657 1877 3413 4211 51 +67 311 7853 53 6971 5011 2377 6563 613 2129 1283 6133 736000 Ave. collisions: 21.071519; Max. collisions:141.000000 5443 7253 1283 3541 503 131 2267 5309 2861 2657 2129 311 4481 5011 421 +1 1163 53 6563 7121 1423 5167 4057 3923 5711 5849 3413 736000 Ave. collisions: 21.947091; Max. collisions:145.000000 6007 53 5011 7121 2129 6701 2741 2267 1619 7253 4211 3413 719 1283 354 +1 2003 2657 1049 7853 6379 6271 5711 3671 1423 1877 1163 736000 Ave. collisions: 20.695421; Max. collisions:127.000000 4751 6563 223 941 2861 1619 1511 6833 3011 2377 7853 7561 5309 5849 34 +13 5573 5711 613 4211 1877 2003 3167 6271 4337 2267 1423 736000 Ave. collisions: 20.448842; Max. collisions:119.000000 4621 2657 4337 311 4057 7457 3167 4211 1423 5309 5167 2377 3413 2861 6 +379 1747 7853 2003 6007 6971 1511 4909 2741 5443 3301 1619 736000 Ave. collisions: 22.696702; Max. collisions:146.000000 2741 4211 2503 1511 1049 3541 4621 2377 4057 719 5443 5711 7691 3011 6 +007 1423 3923 2657 1163 3301 6701 827 1619 2861 53 6271 736000 Ave. collisions: 22.131143; Max. collisions:127.000000 3011 2129 6133 7691 6971 1747 2861 3413 4337 3923 5849 941 1619 1163 4 +211 3167 1877 3301 53 1511 503 2267 827 131 7253 2003 736000 Ave. collisions: 22.948541; Max. collisions:175.000000 3923 223 2377 5849 2741 5167 5309 2861 7853 5443 6701 1283 6133 3671 7 +561 1423 3797 6271 941 613 6971 6563 131 1049 503 4621 736000 Ave. collisions: 18.006383; Max. collisions:114.000000 827 4621 1283 4057 3541 7253 7561 613 1423 1511 941 2267 4909 2503 116 +3 3413 3301 5573 2657 6007 4751 6971 1747 4481 719 2377 736000 Ave. collisions: 20.982645; Max. collisions:120.000000 1163 7253 1619 3541 6007 223 613 2861 4211 1877 2129 2003 719 6563 745 +7 2377 1747 5573 6133 3167 7121 5011 2741 6379 5167 6701 736000 Ave. collisions: 19.644637; Max. collisions:114.000000 3923 2377 3671 6133 6563 1877 2267 2129 4481 4211 3541 2003 7561 941 7 +19 613 1163 1511 4621 2503 5309 409 7457 7691 1283 6379 736000 Ave. collisions: 19.671930; Max. collisions:131.000000 7561 3301 1747 1423 223 4481 941 53 1163 4751 6833 2267 5573 7253 7853 + 6563 4211 1049 7691 1619 311 2657 3413 2503 7457 131 736000 Ave. collisions: 19.086505; Max. collisions:147.000000 7691 3671 1283 131 1163 5011 5849 7853 3301 2003 3797 4621 4337 5167 3 +11 7121 2861 3011 1619 2377 2267 2657 1049 6701 6379 223 736000 Ave. collisions: 19.015522; Max. collisions:164.000000 6701 7561 719 613 3923 7853 4057 827 941 5711 2503 1049 3671 1283 7691 + 4621 7121 4909 5167 5573 3011 2267 311 1747 2129 3167 736000 Ave. collisions: 19.334603; Max. collisions:119.000000 1619 4621 3301 3541 6971 2861 53 1283 3671 7691 409 6833 4481 3413 670 +1 5011 2741 1747 5711 2267 2377 1163 5443 5167 2503 3923 736000 Ave. collisions: 21.216638; Max. collisions:128.000000 5573 3671 7121 223 5011 7457 941 719 2003 1511 1423 3541 6007 3797 187 +7 5167 4751 3923 2129 6271 7691 131 7253 1049 5849 827 736000 Ave. collisions: 18.597651; Max. collisions:123.000000 1283 7253 2503 3797 7691 6379 1747 7121 311 6563 4909 3301 613 5849 10 +49 4211 4621 827 1163 1619 5443 1511 6133 5309 6833 5167 736000 Ave. collisions: 18.933860; Max. collisions:124.000000 6007 5167 4481 4909 6271 5309 3797 311 1283 1423 7121 7691 223 1049 42 +11 53 5573 2377 131 827 2003 2657 6379 5011 5443 1163 736000 Ave. collisions: 20.254766; Max. collisions:124.000000 4211 7121 6833 3413 3923 2267 5011 1619 5849 131 5443 3167 2503 7691 4 +337 941 2003 2129 4481 53 6971 1877 1283 2861 6563 4751 736000 Ave. collisions: 22.598478; Max. collisions:154.000000 409 5711 1747 5167 53 1423 1619 2267 223 3301 6563 3541 6271 4621 3923 + 131 5573 7691 6833 4481 2741 719 6007 2503 4909 3797 736000 Ave. collisions: 19.441816; Max. collisions:124.000000 1877 3671 7853 3797 2003 1283 2657 7457 4621 5849 3167 2741 3413 1423 +6833 3923 5167 6133 827 2503 409 3011 2377 3301 5711 1747 736000 Ave. collisions: 21.433425; Max. collisions:132.000000

I've done a bunch of runs like the one above and the results are remarkably consistent, with the average number of collisions rarely going below 18/sig or above 22/sig. On the face of things, seemingly, any shuffle is good enough to reduce the collision rate by a half relative to monotonically increasing spaced primes.

But therein lies my quandary, monotonically increasing values is just another of the possible shuffles; so why would every other shuffle produce so much better results? Then I looked more closely at the results I was getting (above and other runs) and in most runs there are one or two that exceed 23/sig average, and usually at most one that get below 18. In the case of the run above this results set stands out:

4751 613 409 53 1283 4057 719 7561 6563 7691 1511 5309 6007 7253 6133 +941 6379 3923 223 1747 131 6971 1163 7853 6701 5167 736000 Ave. collisions: 16.439247; Max. collisions:108.000000

With the average collisions >1.5/sig lower than the next best; but there does not seem to be anything special about the ordering of the primes. (Note:The primes used are the same in all cases, only their ordering varies.)

So now I'm thinking that there may be an "optimal ordering" for any given set of prime multipliers, and I'm considering ways that I might discover that ordering.

The obvious programming approach to this goal would be to implement a GA for the problem. The scoring function is trivial, if not especially quick, and the selection criteria could be the average or maximum collision rates, or some weighted combination of the two. I favour the former.

But where I'm having trouble is trying to decide a good mutation strategy? Looking at the results sets above (and the others I've generated), there seems to be a great preponderance of average results, and very few 'good' ones; so I'm think that any kind of mass reordering of the good sets, as a mutation strategy, is simply likely to produce more average sets?

So, I'm thinking rather than mutating the top 10 or twenty results in some large way, I should produce the next generation by swapping several, single pairs of values in each of the (say) top three best results from the previous generation?

Does anyone have a better suggestion for a mutation strategy?

And finally (sorry this got so long), do any of the math-heads out there have any suggestions for a better -- perhaps mathematical -- method of choosing and/or ordering of the low prime weights? Or indeed, and other set of numerical weights that won't exceed my 64-bit integers when sum-producted?

Thanks for your time and considerations; any thoughts and suggestions are welcomed.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: An optimal solution to finding weights for a weighted sum signature.
by Eily (Monsignor) on Nov 04, 2016 at 17:52 UTC

    It's not about primes, it's about big numbers, with distributions 1, 1000, 10000000, the first char will contribute to the LSBs, the second char to more significant bits and so on. The higher the ratio between those numbers, the more distance between the bit your chars can change. You see collisions when there is an overlap. For example with 26 values and weights of 1, 8, and 64, each char will directly affect a range of 8 bits, and each range will be at a distance of 3 bits from the next, with an overlap of 5. But with weights 1, 32 and 1024, you'll get no overlap in the range of bits each char can affect (so no carry).

    Actually the perfect list of weights in your example with chars a..z is 1, 26, 26^2, 26^3 .... But this will not fit into your integers as soon as 26^L > 2^64 with L the maximum length of your string.

    The general best distribution (no collisions, ignoring the limitation of 64 bits) is the list of K^N with K the number of possible characters. And I'm pretty sure the list of K^(N/C) should give you an average collision close to C (C strings for each value), but would require an integer with C times less bits.

      Actually the perfect list of weights in your example with chars a..z is 1, 26, 26^2, 26^3 .... But this will not fit into your integers as soon as 26^L > 2^64 with L the maximum length of your string.

      Yes. That is what I was attempting to describe in the OP when I said: "I could calculate the positional power of the maximum number of characters and generate a guaranteed unique number, but the number of unique items in my collection is ~2e5, the subsets can be hundreds of elements each, so the signature values would rapidly become so huge as to be unmanageable.".

      Using this schema on simple character data with an alphabet of just 26 chars means I would run over the limit of my 64-bit ints after only 13 characters.

      But as I said above, in a typical example of the type of problem I am trying to cater for, the 'alphabet' could be as large as 200,000 or even larger; and size of my subsets (aka. length of the strings), is frequently multiple hundreds or even low thousands. As such, there can be no "perfect set of weights".

      However, those numbers do hide some good news. With numbers this large, the number of combinations is ludicrously large, so even if I need to signature hundreds of billions of them, I will still only be processing a tiny, miniscule percentage of the total, so the likelihood that I would encounter a collision without it being an exact duplicate -- assuming even distributions -- is very slim indeed.

      To try and put some flesh on that assertion. Let's say my 'alphabet' is 200,000 (N) unique things; and my subsets each consist of some combination of 1000 (K) of those things, then the formula for the total number of combinations is (N+K-1)! / (K! * (N-K)!), which for those example numbers comes out at: 1.414e8029

      So even if I need to process one trillion of those, that's still only 0.00000000...{8000 zeros}...00000007%. False collisions are unlikely.

      Of course, totally even distributions in real world data are few and far between, but still it means that using a less than perfect signature mechanism is very likely to detect actual duplicates and rarely detect false ones.

      It's not about primes,

      Perhaps not mathematically, but practically it is. In part because I'm making it so. It is easy to demonstrate (with small subsets) that if I used powers of 2 weights, the incidence of collisions is vastly reduced:

      C:\test>weightedPrimeSumProduct -K=6 -M=16 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131 +072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 736000 Ave. collisions: 7.091558; Max. collisions:79.000000

      But that limits me to 'alphabets' of 63 chars at most; and practically to far less once sum of products plays out.

      I might consider trying to stretch the alphabet by assigning each power of two to two or more items in my collection, but that causes a vary rapid climb in the false positives:

      C:\test>weightedPrimeSumProduct -K=6 -M=16 1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 512 512 1024 1024 20 +48 2048 4096 4096 736000 Ave. collisions: 81.936457; Max. collisions:671.000000 C:\test>weightedPrimeSumProduct -K=6 -M=16 1 1 1 2 2 2 4 4 4 8 8 8 16 16 16 32 32 32 64 64 64 128 128 128 256 256 736000 Ave. collisions: 414.806197; Max. collisions:2931.000000

      Almost back to monotonic sequential weightings. And this is where the idea of using primes to expand the alphabet size that could be handled whilst reducing the chances of arithmetic collisions. The sum of multiples of unique primes is less likely to collide. And by using every second, third or fourth prime, the collisions reduce further. There is no real surprise in any of that.

      What did surprise me was the effect that shuffling the primes had on reducing the collision rates. And that bring me back to my questions in the OP regarding how to determine the best ordering of the chosen set of primes.

      I am quite convinced that a GA is the only game in town for arriving at a good, if not necessarily optimal ordering; but I'm still trying to wrap my head around a suitable mutation algorithm to allow the GA to arrive at a good solution within a reasonable time frame.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice.
      i

        With ~2e5, you'll have to do something beside the simple weighted sum (like a modulus) if you want to avoid the bell shape, which tends to be a thing with the sum of independant variables (with identical probability distribution - not your case with the weights - you'd actually get close to a gaussian distribution, a perfect bell)

        The sum of multiples of unique primes is less likely to collide.
        That's not a property of primes I know. The reason primes gives you better result is because their progression is close enough to an exponential one.

        What did surprise me was the effect that shuffling the primes had on reducing the collision rates.
        I actually believed this couldn't be true until I realized you are not working with fixed length input. If you were, you'd just have to apply the same permutation to the string as the one applied to the primes to obtain the same result:
        a b c d 3 7 17 29
        Is the same as:
        b d a c 7 29 3 17
        and so on. But with variable length input, the coefficients on the left are more likely to be used. So you obtain better result by having coefficients that are more distant in the front and the "less efficient" ones at the end. You can start by 1, and then add your coefficient by maximazing the ratio (the distance between the bits two characters will affect is log(K1)-log(K2), so an increasing function of (K1/K2)). Taking the powers of two up to 2^16 that could be: 2^0, 2^16, 2^8, 2^12, 2^4, 2^10, 2^6, 2^14, 2^2, 2^15, 2^1, 2^13, 2^3, 2^11, 2^5, 2^9, 2^7 The biggest ratio you can get with those numbers is of course 2^16. Then you take number that's halfway (in terms of ratio) between: 2^16. 2^12 and 2^4 both have a minimum ratio with one of the previous number of 2^4 (I put 2^12 first because it's the further away from 2^0, which is the coefficient with the highest probability to be used). Then 2^10, 2^6, 2^14 and 2^2. All the remaining order introduce two new 2^0 ratios.

        I hadn't been able to figure out why the shuffling helped you, either. Before Eily's post, I was going to lean down the road of suggesting comparing your monotonically increasing ramp of weights with a monotonically decreasing ramp of weights, and see whether collisions were affected. And then recommend trying a list of weights which was half ramp + half random (actually, three sets of half-ramp: #1:first half increasing, #2:middle half increasing (quarter random on each side), #3: final half increasing. (And compare all those to one of your fully-shuffled weight setes.) If you saw that the down-ramp and the the half-ramps all had worse collisions than your shuffle, then I would suggest trying a mutator that tried to get rid of rampy-segments: if you found a sequence of weights that were strictly increasing (or generally increasing, with occasional excursions), I would suggest having the mutator pick new random primes or just re-shuffle the rampy section.

        Also, I have found that histograms can hide some critical information. With your ordered set of 736000 combinations from the first example, you might want to comapre the time-series of the generated signatures -- probably with the multiple weight lists: ramp up, ramp down, 3 half ramps, and one or two completely random weights, to see if any patterns jump out and give you a hint for what your mutator could do.

        Finally, as a last thought: as Eily said, sums of primes don't approach uniqueness. However, products of primes do. I know that purely multiplying primes is unique. I spent some time while I couldn't sleep last night trying to come up with a reasonable way to use that, since you'd obviously quickly overrun your 64bit integer with a product of K=1000 primes. :-) I think I came up with something (after looking up a few primey facts today) that would be a product of two primes.

        • To fit in 64bits, each prime would have to be less than 2**32. The biggest 32bit prime number (1) is 4_294_967_291, which is the 203_280_221st (2) prime.
        • That number of multiplyable-primes is bigger than your expected N≥200_000, so in theory, we could assign two primes (or 1000) to each of your N things, and multiple two primes together.
        • If you didn't want to just start doling out primes randomly, you could do something like
          • @prime[1 .. 200_000] = import_from_list_of_first_million_primes()
          • generate two 1000-weight lists @wa and @wb, using your randomly shuffled primes
          • for each $comb, generate two sigs $sa and $sb in your sig function
            • even if one of your sigs had a collsion, the chances that both collide at the same time are significanlty smaller
            • return $product = $prime[$sa] * $prime[$sb]
Re: An optimal solution to finding weights for a weighted sum signature.
by Anonymous Monk on Nov 04, 2016 at 16:49 UTC

    Is there some reason you don't want to use an existing algorithm like CRC or MD5?

    In order to reduce the collision rate, it will help if you spread out the signature values over a larger output set. To keep from overflowing, you can choose a big modulus $M.

    $x = ($x + $P[$_] * $str[$_]) % $M for 0 .. $#str

    It's common to choose $M to be a power of two, so "% $M" becomes "& ($M - 1)". Choosing $M prime has some mathematical advantages, but as long as none of the @P values has any common factors with $M it works ok. If you set $P[$_] to $K**$_ for some $K, you won't even have to store @P.

      Is there some reason you don't want to use an existing algorithm like CRC or MD5?

      Yes. Whilst in my examples I'm using strings of letters for simplicity and conciseness; the actual collection could be pretty much anything. Eg. Instead of letters in a word; in might be the existence of and ordering of words in a sentence, or the existence of phrases, regardless of ordering in a document; or literally any other combinatorial subset of attributes of some datasets.

      In order to use crc or MD5 for these other kinds of subset collections, it would require two passes: one to encode (say hash lookup mapping) the collection subset into a 'string' digestible by those algorithms; and a second pass over that string to arrive at the signature.

      If I can accumulate the signature whilst iterating the first pass, I can save a substantial amount of time. In a previous attempt at a similar problem I came up with an algorithm -- very specific to the data being processed -- that proved to be almost 50 times faster than first stringifying then MD5. With a dataset measured in terabytes, that was a substantial saving.

      To keep from overflowing, you can choose a big modulus $M.

      That's a good suggestion, but there is a problem with it. With a pure (non-modulo) arithmetic signature, the values tend to group like-near-like in a nearly continuous ordering. It isn't a collating order that anyone would recognise, but similar subsets tend to be collate close to each other and transition gradually from one end of the spectrum to the other.

      However, once you start wrapping the signatures back on themselves using modulo arithmetic, that useful artifact of the non-modulo arithmetic disappears. It's not always needed or useful, but if I can avoid it, it might allow a more generic solution with additional useful properties.

      It is certainly something for me to keep in mind if another solution doesn't present itself. Thankyou.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Actually, both CRC and MD5 can be calculated incrementally without having to stringify the entire dataset at once. That's what the MD5 context object is for. I don't know of a CRC module that provides that functionality, but it's easy to implement.

        But yeah, coming up with a function that has the "nonrandom" behavior you're looking for is going to be very tricky and data-dependent.