Re^4: An optimal solution to finding weights for a weighted sum signature.

The reason primes gives you better result is because their progression is close enough to an exponential one.

Hm. With powers of 10, you allow for a alphabet of 20. With powers of 2, 64.

Even if you move to powers of decimal values, you can still only cater for relatively small alphabets: eg 1.1^N exceeds 2^64 after N=466 :

$l = 2**64; $x = 1.1; $n = 1; 1 while ($x**$n++) < $l; print $n;;
467
[download]

My reason for choosing primes is there are approximately 416 quadrillion primes less than 2^64; enough to allow for even the largest alphabet and still be able to spread them out by using every 10th, 20th or even 100th prime.

I actually believed this couldn't be true until I realized you are not working with fixed length input.

I was confused as to what you meant by "not fixed length input"; but eventually the penny dropped.

Yes, having more weights than there are members in each subset, and shuffling them all before selecting the first K to match the subset size, is why shuffling works.

You can start by 1, and then add your coefficient by maximizing the ratio (the distance between the bits two characters will affect is log(K1)-log(K2), so an increasing function of (K1/K2)).

That's an interesting theory. (Or fact? It's not clear from your post; and my math isn't strong enough to decide one way or the other.)

If that process of 'interleaving' the weights approaches optimal, then when I run my GA on a large selection of primes, I would expect it to quite quickly start converging on numbers that come close to that kind of relationship.

I did start a run of the GA this morning; but I used too large an alphabet (1000) and my mutation function was not good, so whilst it was slowly improving with each iteration, it was going to run for a very long time before getting anything noteworthy.

I've had some more thoughts about the mutation strategy; and tomorrow I'll code that up and start a run using a much smaller alphabet (100 or 200) and see how things are looking after a couple of hours. It'll be really interesting of the results bear out your thesis. (Assuming my new mutation sub works.)

Once I get some results worth sharing, I'll come back.

BTW: Thank you for your time and input. This problem is one I've been playing with on and off going back years. At this point its not actually for anything in particular; but rather my attempt to find a generic or at least more general solution to a problem that I've had to solve many times down the years; and each time has had to be developed to be very specific to the exact dataset it has to handle.

That's a pain in the rump because: 1) it can take a lot of effort and testing to develop an algorithm that works for a particular dataset, and then it cannot be reused; 2) if the dataset for a particular project changes over time, it can necessitate reiterating the development process to re-tailor it to the new state of play.

My hope is that if I can find a reasonable (automated and doesn't take too long) way of choosing a set and ordering of primes for a particular size of alphabet it could save a great deal of effort in the future.

my current notion goes something like this: Feed in the size of the alphabet; divide that into the number of primes less 2^64 and use half of the the result as a step size to select a bunch of widely spaced primes. Then run a GA on those primes over (say) 100 million random combinations and use the best ordering found by the GA for the actual processing.

It's somewhere to start )

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^4: An optimal solution to finding weights for a weighted sum signature. Download Code