in reply to Fast string similarity method

If you define a cluster by a string which all others in the cluster are at least 80% similiar, then instead of comparing each string to all others, only compare against the cluster sting(the one which defines the cluster). There are then two cases to look at.

If the string would fit into the 80% mark for more then one cluster, then I would suggest putting it in the one which it is most silimiar to it.

The second case is the harder one. If it does not pass the 80% mark for any cluster then you want to create a cluster that it will fit into. The simplest way todo that is to just use it as the base for the new cluster. The problem is that you may not end up with the most disimiliar clusters, which is probably what you want.(Note that when the most disimilar cluster bases are found you will have the smallest number of clusters

To solve that problem you would want to try and find a set of the most disimiliar strings in the set. The method for finding the most distant strings from the set would depend on the method for finding the similiarity between the strings.

Of course just using the method above will give some preformance gain(depending on how many clusters are in the set).