If you define a cluster by a string which all others in the cluster are at least 80% similiar, then instead of comparing each string to all others, only compare against the cluster sting(the one which defines the cluster). There are then two cases to look at.

If the string would fit into the 80% mark for more then one cluster, then I would suggest putting it in the one which it is most silimiar to it.

The second case is the harder one. If it does not pass the 80% mark for any cluster then you want to create a cluster that it will fit into. The simplest way todo that is to just use it as the base for the new cluster. The problem is that you may not end up with the most disimiliar clusters, which is probably what you want.(Note that when the most disimilar cluster bases are found you will have the smallest number of clusters

To solve that problem you would want to try and find a set of the most disimiliar strings in the set. The method for finding the most distant strings from the set would depend on the method for finding the similiarity between the strings.

Of course just using the method above will give some preformance gain(depending on how many clusters are in the set).


In reply to Re: Fast string similarity method by neosamuri
in thread Fast string similarity method by icanwin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.