BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:
A simple mechanism for producing a signature from a subset of a large collection (eg. strings of characters) is a weighted sum calculation. Ie. assign each character a numeric value; multiply each character in the strings value by a weight, say its position in the string, and sum the resultant products:
sub sig{ my $sig = 0; my $n = 0; $sig += ( $_ - 96 ) * ++$n for unpack 'C*', $_[0]; return $sig; };; print $string = join '', map chr( 97+rand( 26) ), 1 .. 10;; oqqcfuwfco print $sig = sig( $string );; 654 print $string = join '', map chr( 97+rand( 26) ), 1 .. 10;; qwfiedngda print $sig = sig( $string );; 366
That's okay as far as it goes, but it produces a bell curve with lots of duplicate signatures:
use Algorithm::Combinatorics qw[ combinations_with_repetition ];; $i = combinations_with_repetition( [ 'a'..'z' ], 6 );; undef %stats; ++$stats{ sig( join'', @$comb ) } while defined( $comb = $i->next ); printf "Ave. collisions:%.2f Max. collisions:%.2f\n", 736281 / keys(%s +tats), max( values %stats );; Ave. collisions:1438.05 Max. collisions:4061.00 pp\%stats;;
Abbreviated output:
pp \%stats;; { 21 => 1, 27 => 1, 32 => 1, 33 => 1, 36 => 1, 38 => 1, 39 => 2, 41 => 1, 42 => 2, 43 => 1, 44 => 1, 45 => 2, 47 => 2, 48 => 2, 49 => 1, 50 => 2, 51 => 3, 52 => 1, 53 => 3, 54 => 4, 55 => 1, 56 => 3, 57 => 5, 58 => 2, 59 => 4, 60 => 5, 61 => 3, 62 => 5, 63 => 7, 64 => 3, 65 => 6, 66 => 6, 67 => 4, 68 => 7, 69 => 9, 70 => 4, 71 => 8, 72 => 10, 73 => 6, 74 => 10, ... 302 => 3024, 303 => 3065, 304 => 2978, 305 => 3110, 306 => 3140, 307 => 3060, 308 => 3188, 309 => 3226, 310 => 3136, 311 => 3267, 312 => 3300, 313 => 3217, 314 => 3344, 315 => 3379, 316 => 3286, 317 => 3419, 318 => 3449, 319 => 3363, 320 => 3489, 321 => 3522, 322 => 3427, 323 => 3561, 324 => 3588, 325 => 3497, 326 => 3621, 327 => 3654, 328 => 3557, 329 => 3687, 330 => 3712, 331 => 3616, 332 => 3741, 333 => 3772, 334 => 3671, 335 => 3798, 336 => 3818, 337 => 3722, 338 => 3845, 339 => 3872, 340 => 3766, 341 => 3890, 342 => 3910, 343 => 3811, 344 => 3930, 345 => 3951, 346 => 3841, 347 => 3965, 348 => 3982, 349 => 3878, 350 => 3991, 351 => 4009, 352 => 3899, 353 => 4019, 354 => 4030, 355 => 3921, 356 => 4031, 357 => 4047, 358 => 3934, 359 => 4047, 360 => 4053, 361 => 3942, 362 => 4049, 363 => 4061, 364 => 3943, 365 => 4050, 366 => 4052, 367 => 3940, 368 => 4042, 369 => 4047, 370 => 3926, 371 => 4027, 372 => 4027, 373 => 3912, 374 => 4006, 375 => 4007, 376 => 3883, 377 => 3980, 378 => 3975, 379 => 3857, 380 => 3944, 381 => 3939, 382 => 3816, 383 => 3906, 384 => 3896, 385 => 3774, 386 => 3854, 387 => 3847, 388 => 3722, 389 => 3805, 390 => 3788, 391 => 3665, 392 => 3740, 393 => 3729, 394 => 3602, 395 => 3675, 396 => 3655, 397 => 3533, 398 => 3601, 399 => 3584, 400 => 3456, 401 => 3521, 402 => 3499, 403 => 3378, 404 => 3437, 405 => 3414, 406 => 3286, 407 => 3347, 408 => 3320, 409 => 3199, 410 => 3249, 411 => 3223, 412 => 3099, 413 => 3151, 414 => 3121, ... 534 => 10, 535 => 7, 536 => 7, 537 => 6, 538 => 4, 539 => 4, 540 => 4, 541 => 2, 542 => 2, 543 => 2, 544 => 1, 545 => 1, 546 => 1, }
For my purposes, that collision ratio is far too high, so I set about looking for a way to reduce it. Obviously, I could calculate the positional power of the maximum number of characters and generate a guaranteed unique number, but the number of unique items in my collection is ~2e5, the subsets can be hundreds of elements each, so the signature values would rapidly become so huge as to be unmanageable. Indeed, when stringified longer than the subsets they are meant precise.
My first though was to replace the monotonic sequential weights with the first N prime numbers:
and that certainly reduces the collisions markedly:
C:\test>junk52 -K=6 -M=1 -S=0 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 + 101 736000 Ave. collisions: 586.678088; Max. collisions:2567.000000
Then I though that instead of using sequential primes, it might help if I chose every other prime, or more wildly spaced, and that worked very well. Each doubling of the spacing between prime weights pretty much halving both the average and maximum number of collisions:
C:\test>junk52 -K=6 -M=1 -S=0 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 + 101 736000 Ave. collisions: 586.678088; Max. collisions:2567.000000 C:\test>junk52 -K=6 -M=2 -S=0 3 7 13 19 29 37 43 53 61 71 79 89 101 107 113 131 139 151 163 173 181 +193 199 223 229 239 736000 Ave. collisions: 439.046512; Max. collisions:1400.000000 C:\test>junk52 -K=6 -M=4 -S=0 7 19 37 53 71 89 107 131 151 173 193 223 239 263 281 311 337 359 383 4 +09 433 457 479 503 541 569 736000 Ave. collisions: 198.298142; Max. collisions:699.000000 C:\test>junk52 -K=6 -M=8 -S=0 19 53 89 131 173 223 263 311 359 409 457 503 569 613 659 719 769 827 8 +81 941 997 1049 1097 1163 1223 1283 736000 Ave. collisions: 93.377425; Max. collisions:468.000000 C:\test>junk52 -K=6 -M=16 -S=0 53 131 223 311 409 503 613 719 827 941 1049 1163 1283 1423 1511 1619 1 +747 1877 2003 2129 2267 2377 2503 2657 2741 2861 736000 Ave. collisions: 44.996700; Max. collisions:227.000000
That's great, but there is a limit to how far I can take it, as the weights get bigger, the sum of products starts to exceed my 64-bit integers for ever smaller subsets. So I then started to think about what else I could do to reduce the collisions, and wonders if shuffling the spaced primes would have any benefit, and that alone again halved the collision rates:
C:\test>junk52 -K=6 -M=16 -S=1 2741 2503 4481 7457 3671 7121 3413 6563 827 3301 7561 4909 3011 2657 3 +11 5167 223 6379 1877 2377 1619 7691 6271 1283 6007 3541 736000 Ave. collisions: 19.635207; Max. collisions:119.000000
Of course, that is only one sample of the possible shuffles, and so I ran bunch of reproducible shuffles to see whether all shuffles would be equal; and if not, could I see any pattern to the ordering of those that produced the best results:
C:\test>for /l %s in (1,17,1000) do @junk52 -K=6 -M=16 -S=%s 2741 2503 4481 7457 3671 7121 3413 6563 827 3301 7561 4909 3011 2657 3 +11 5167 223 6379 1877 2377 1619 7691 6271 1283 6007 3541 736000 Ave. collisions: 19.635207; Max. collisions:119.000000 3671 3797 1511 4057 311 2003 1283 223 1423 7457 3301 5573 3541 7691 13 +1 1747 5167 6379 7561 1163 1049 6133 3923 4211 6271 4481 736000 Ave. collisions: 20.188676; Max. collisions:134.000000 941 719 7561 503 4909 6563 4621 7253 3413 1877 5443 1423 3541 409 1511 + 4337 1049 3301 2503 311 6007 6271 6833 3011 7457 5011 736000 Ave. collisions: 19.603839; Max. collisions:123.000000 5011 2741 3167 6007 6563 3413 6971 1747 7253 4057 2267 6271 409 3541 4 +337 7853 613 4751 5167 2657 5443 1283 6379 7457 223 5573 736000 Ave. collisions: 19.998397; Max. collisions:141.000000 941 503 6133 3671 7457 2129 613 2741 7561 1423 223 5849 2503 2267 3413 + 827 2861 3923 1511 719 2003 311 6379 7853 6701 131 736000 Ave. collisions: 18.564826; Max. collisions:121.000000 6701 7691 2267 6007 2003 503 4057 4337 1283 2503 1163 5849 6971 6271 6 +563 7853 1511 4909 3011 7253 7457 7561 3797 941 4621 311 736000 Ave. collisions: 19.263278; Max. collisions:146.000000 4211 4621 7121 1619 3301 719 7691 7561 827 1049 2861 2657 5011 6833 50 +3 6133 2267 311 2129 1283 2503 3167 4751 5849 941 409 736000 Ave. collisions: 18.665543; Max. collisions:106.000000 941 7457 2741 719 4057 1619 1747 4751 827 6271 3011 1049 4621 6379 265 +7 5573 6133 1877 3671 2267 7561 6971 6007 311 4909 613 736000 Ave. collisions: 18.402424; Max. collisions:109.000000 613 311 7561 7253 5309 7121 3011 4621 2861 4337 503 2741 3167 6833 367 +1 2377 131 1423 1163 1747 7853 6007 4751 3541 1511 6133 736000 Ave. collisions: 19.277904; Max. collisions:123.000000 2503 6271 3797 6701 2741 2129 2003 1619 3301 7853 613 2267 6007 3671 4 +621 6563 1423 6379 3167 3923 1747 4211 7253 2377 4057 4751 736000 Ave. collisions: 21.288411; Max. collisions:136.000000 4909 1423 2861 4211 7121 1511 6271 2741 1747 3923 719 5011 5309 4337 5 +849 5711 2003 7253 3541 6833 4481 3797 311 613 6701 1163 736000 Ave. collisions: 20.972484; Max. collisions:141.000000 4751 7253 5167 1511 5011 1747 5443 3167 4057 2741 3541 2129 53 1163 68 +33 6133 6563 5309 2503 1283 7457 3413 4337 1619 3923 4481 736000 Ave. collisions: 21.680830; Max. collisions:175.000000 2129 1747 3797 2267 7853 1049 4211 4481 2003 3301 3671 7253 4621 1511 +6833 5849 4751 2377 3541 7691 5011 6007 1163 53 7121 1619 736000 Ave. collisions: 21.578529; Max. collisions:153.000000 4337 53 4057 5573 6563 3413 4751 223 2267 3011 1049 6271 6007 2503 530 +9 3797 1423 1163 2861 3167 6971 3541 409 6379 6133 1747 736000 Ave. collisions: 19.616374; Max. collisions:133.000000 2377 3671 7691 311 2267 1747 1283 3797 1877 4211 1423 1619 5167 7561 2 +741 1163 6833 719 6007 6271 1511 3167 7253 5573 3301 2003 736000 Ave. collisions: 23.313311; Max. collisions:186.000000 1619 2503 5011 5309 3011 6007 503 2377 719 4909 2129 2003 1163 3413 71 +21 6971 3541 4751 3301 6379 5167 941 5711 311 6833 2267 736000 Ave. collisions: 20.357250; Max. collisions:137.000000 3923 4057 7561 53 2267 6971 6563 4751 5711 7853 3413 2129 719 5573 421 +1 2657 3011 1877 7121 5443 2861 2377 503 1283 4909 2503 736000 Ave. collisions: 20.486394; Max. collisions:135.000000 1049 827 2129 7121 1619 1877 53 3671 6701 719 1511 3413 311 4057 4211 +6833 2503 2003 7253 5011 5849 5167 3011 6971 1163 2657 736000 Ave. collisions: 19.818072; Max. collisions:128.000000 503 2377 6971 4909 1163 2267 5167 3413 719 4621 409 6833 3301 1283 571 +1 6133 7121 941 311 4211 3167 223 4751 5849 4057 2861 736000 Ave. collisions: 20.875560; Max. collisions:119.000000 2741 3301 6563 6007 1283 3797 5167 2129 2503 4211 5011 4751 3011 941 6 +133 7561 3923 4621 1163 7121 1423 3167 613 1619 1747 4909 736000 Ave. collisions: 22.637387; Max. collisions:154.000000 2129 6701 3011 1877 3671 7121 5711 4909 1049 3797 1511 4481 1619 6133 +6833 1423 5573 5443 131 2861 827 7691 7253 5849 4057 7853 736000 Ave. collisions: 20.488104; Max. collisions:120.000000 3797 4057 4481 827 4211 503 5711 7561 6271 6833 719 4751 1423 131 2267 + 4337 1163 5011 1049 2503 2861 2003 7457 941 2377 3011 736000 Ave. collisions: 20.286576; Max. collisions:150.000000 409 6563 4909 1283 6701 53 5309 7121 4057 4751 5573 311 1049 6971 4621 + 7691 3923 3167 2741 4481 2267 3797 719 6271 2657 613 736000 Ave. collisions: 19.802614; Max. collisions:133.000000 3413 1511 6271 1049 6833 2741 4751 4057 4337 7853 6007 1423 1619 3541 +2503 4909 2657 3797 2861 2003 1747 3011 53 311 131 5011 736000 Ave. collisions: 21.646410; Max. collisions:175.000000 409 3923 4909 5309 131 4337 7457 7121 613 719 2129 6563 3797 6133 4621 + 5849 1619 4057 7561 6271 1163 2657 3301 1283 5443 941 736000 Ave. collisions: 19.018960; Max. collisions:122.000000 4481 6007 2503 4337 1511 5309 1049 5711 2003 2129 6833 7121 5573 3167 +1747 2267 2861 1877 3541 311 3923 3671 827 6271 5849 3301 736000 Ave. collisions: 23.165146; Max. collisions:182.000000 1163 53 1877 4751 1283 311 409 4211 6133 6379 6833 3167 1511 6701 5573 + 6971 3541 7121 3797 2377 7853 1619 131 613 827 7691 736000 Ave. collisions: 17.260901; Max. collisions:111.000000 2657 827 4211 7253 3011 5309 4337 6271 1423 2003 1747 3541 409 6007 36 +71 5443 7853 941 5711 5849 6563 3797 3301 7561 7121 1619 736000 Ave. collisions: 19.355442; Max. collisions:111.000000 3923 2503 6379 719 5573 827 613 1049 3301 1163 7853 1283 6133 6007 409 + 6563 6833 4481 7121 2129 1511 2003 3011 6701 4909 1747 736000 Ave. collisions: 19.884976; Max. collisions:140.000000 2741 719 5711 7561 2657 6379 6563 1877 503 613 2129 4211 7691 1283 448 +1 4621 5011 3413 7853 1163 7121 4337 223 6833 4057 5573 736000 Ave. collisions: 18.342825; Max. collisions:119.000000 941 4481 6563 5167 7691 1877 6007 1163 2861 223 3541 4621 5711 4751 16 +19 3671 4211 1511 1283 5849 503 2377 2657 2741 6701 2267 736000 Ave. collisions: 22.146454; Max. collisions:146.000000 4751 613 409 53 1283 4057 719 7561 6563 7691 1511 5309 6007 7253 6133 +941 6379 3923 223 1747 131 6971 1163 7853 6701 5167 736000 Ave. collisions: 16.439247; Max. collisions:108.000000 2129 1423 5711 7691 409 3413 719 3301 503 2741 2267 2503 3011 4751 316 +7 3797 7561 613 4337 827 5167 223 7253 6133 2861 5309 736000 Ave. collisions: 19.102351; Max. collisions:111.000000 6971 6271 409 3797 131 6007 5573 2003 503 6379 6833 613 827 5849 1619 +6701 941 7561 4057 2741 1423 1511 3413 5309 4481 5443 736000 Ave. collisions: 18.401505; Max. collisions:108.000000 5711 2741 1423 2861 7253 5309 6133 2657 6007 1877 2267 1049 409 4211 7 +19 223 4909 5849 7691 3167 827 3923 1283 3011 3671 131 736000 Ave. collisions: 21.309360; Max. collisions:129.000000 4337 2003 1423 2657 827 719 3797 2129 5711 7121 941 6133 53 2267 2741 +3301 7253 3413 3671 6833 7853 5309 409 3011 6563 5573 736000 Ave. collisions: 21.534351; Max. collisions:147.000000 7853 6563 4211 2861 131 3413 6833 5309 3797 3671 1049 3167 4337 1747 1 +619 223 3541 7121 719 1877 5167 2741 5849 4621 3923 409 736000 Ave. collisions: 20.528090; Max. collisions:135.000000 3301 7853 3797 4057 2003 719 6271 2503 6379 5443 6133 827 6971 7457 53 + 3167 2129 3011 1747 2657 1619 2377 4751 941 1049 7121 736000 Ave. collisions: 19.801549; Max. collisions:115.000000 1619 3923 7253 2741 941 719 3671 5573 5443 2861 2657 1877 3413 4211 51 +67 311 7853 53 6971 5011 2377 6563 613 2129 1283 6133 736000 Ave. collisions: 21.071519; Max. collisions:141.000000 5443 7253 1283 3541 503 131 2267 5309 2861 2657 2129 311 4481 5011 421 +1 1163 53 6563 7121 1423 5167 4057 3923 5711 5849 3413 736000 Ave. collisions: 21.947091; Max. collisions:145.000000 6007 53 5011 7121 2129 6701 2741 2267 1619 7253 4211 3413 719 1283 354 +1 2003 2657 1049 7853 6379 6271 5711 3671 1423 1877 1163 736000 Ave. collisions: 20.695421; Max. collisions:127.000000 4751 6563 223 941 2861 1619 1511 6833 3011 2377 7853 7561 5309 5849 34 +13 5573 5711 613 4211 1877 2003 3167 6271 4337 2267 1423 736000 Ave. collisions: 20.448842; Max. collisions:119.000000 4621 2657 4337 311 4057 7457 3167 4211 1423 5309 5167 2377 3413 2861 6 +379 1747 7853 2003 6007 6971 1511 4909 2741 5443 3301 1619 736000 Ave. collisions: 22.696702; Max. collisions:146.000000 2741 4211 2503 1511 1049 3541 4621 2377 4057 719 5443 5711 7691 3011 6 +007 1423 3923 2657 1163 3301 6701 827 1619 2861 53 6271 736000 Ave. collisions: 22.131143; Max. collisions:127.000000 3011 2129 6133 7691 6971 1747 2861 3413 4337 3923 5849 941 1619 1163 4 +211 3167 1877 3301 53 1511 503 2267 827 131 7253 2003 736000 Ave. collisions: 22.948541; Max. collisions:175.000000 3923 223 2377 5849 2741 5167 5309 2861 7853 5443 6701 1283 6133 3671 7 +561 1423 3797 6271 941 613 6971 6563 131 1049 503 4621 736000 Ave. collisions: 18.006383; Max. collisions:114.000000 827 4621 1283 4057 3541 7253 7561 613 1423 1511 941 2267 4909 2503 116 +3 3413 3301 5573 2657 6007 4751 6971 1747 4481 719 2377 736000 Ave. collisions: 20.982645; Max. collisions:120.000000 1163 7253 1619 3541 6007 223 613 2861 4211 1877 2129 2003 719 6563 745 +7 2377 1747 5573 6133 3167 7121 5011 2741 6379 5167 6701 736000 Ave. collisions: 19.644637; Max. collisions:114.000000 3923 2377 3671 6133 6563 1877 2267 2129 4481 4211 3541 2003 7561 941 7 +19 613 1163 1511 4621 2503 5309 409 7457 7691 1283 6379 736000 Ave. collisions: 19.671930; Max. collisions:131.000000 7561 3301 1747 1423 223 4481 941 53 1163 4751 6833 2267 5573 7253 7853 + 6563 4211 1049 7691 1619 311 2657 3413 2503 7457 131 736000 Ave. collisions: 19.086505; Max. collisions:147.000000 7691 3671 1283 131 1163 5011 5849 7853 3301 2003 3797 4621 4337 5167 3 +11 7121 2861 3011 1619 2377 2267 2657 1049 6701 6379 223 736000 Ave. collisions: 19.015522; Max. collisions:164.000000 6701 7561 719 613 3923 7853 4057 827 941 5711 2503 1049 3671 1283 7691 + 4621 7121 4909 5167 5573 3011 2267 311 1747 2129 3167 736000 Ave. collisions: 19.334603; Max. collisions:119.000000 1619 4621 3301 3541 6971 2861 53 1283 3671 7691 409 6833 4481 3413 670 +1 5011 2741 1747 5711 2267 2377 1163 5443 5167 2503 3923 736000 Ave. collisions: 21.216638; Max. collisions:128.000000 5573 3671 7121 223 5011 7457 941 719 2003 1511 1423 3541 6007 3797 187 +7 5167 4751 3923 2129 6271 7691 131 7253 1049 5849 827 736000 Ave. collisions: 18.597651; Max. collisions:123.000000 1283 7253 2503 3797 7691 6379 1747 7121 311 6563 4909 3301 613 5849 10 +49 4211 4621 827 1163 1619 5443 1511 6133 5309 6833 5167 736000 Ave. collisions: 18.933860; Max. collisions:124.000000 6007 5167 4481 4909 6271 5309 3797 311 1283 1423 7121 7691 223 1049 42 +11 53 5573 2377 131 827 2003 2657 6379 5011 5443 1163 736000 Ave. collisions: 20.254766; Max. collisions:124.000000 4211 7121 6833 3413 3923 2267 5011 1619 5849 131 5443 3167 2503 7691 4 +337 941 2003 2129 4481 53 6971 1877 1283 2861 6563 4751 736000 Ave. collisions: 22.598478; Max. collisions:154.000000 409 5711 1747 5167 53 1423 1619 2267 223 3301 6563 3541 6271 4621 3923 + 131 5573 7691 6833 4481 2741 719 6007 2503 4909 3797 736000 Ave. collisions: 19.441816; Max. collisions:124.000000 1877 3671 7853 3797 2003 1283 2657 7457 4621 5849 3167 2741 3413 1423 +6833 3923 5167 6133 827 2503 409 3011 2377 3301 5711 1747 736000 Ave. collisions: 21.433425; Max. collisions:132.000000
I've done a bunch of runs like the one above and the results are remarkably consistent, with the average number of collisions rarely going below 18/sig or above 22/sig. On the face of things, seemingly, any shuffle is good enough to reduce the collision rate by a half relative to monotonically increasing spaced primes.
But therein lies my quandary, monotonically increasing values is just another of the possible shuffles; so why would every other shuffle produce so much better results? Then I looked more closely at the results I was getting (above and other runs) and in most runs there are one or two that exceed 23/sig average, and usually at most one that get below 18. In the case of the run above this results set stands out:
4751 613 409 53 1283 4057 719 7561 6563 7691 1511 5309 6007 7253 6133 +941 6379 3923 223 1747 131 6971 1163 7853 6701 5167 736000 Ave. collisions: 16.439247; Max. collisions:108.000000
With the average collisions >1.5/sig lower than the next best; but there does not seem to be anything special about the ordering of the primes. (Note:The primes used are the same in all cases, only their ordering varies.)
The obvious programming approach to this goal would be to implement a GA for the problem. The scoring function is trivial, if not especially quick, and the selection criteria could be the average or maximum collision rates, or some weighted combination of the two. I favour the former.
But where I'm having trouble is trying to decide a good mutation strategy? Looking at the results sets above (and the others I've generated), there seems to be a great preponderance of average results, and very few 'good' ones; so I'm think that any kind of mass reordering of the good sets, as a mutation strategy, is simply likely to produce more average sets?
So, I'm thinking rather than mutating the top 10 or twenty results in some large way, I should produce the next generation by swapping several, single pairs of values in each of the (say) top three best results from the previous generation?
And finally (sorry this got so long), do any of the math-heads out there have any suggestions for a better -- perhaps mathematical -- method of choosing and/or ordering of the low prime weights? Or indeed, and other set of numerical weights that won't exceed my 64-bit integers when sum-producted?
Thanks for your time and considerations; any thoughts and suggestions are welcomed.
|
|---|