How do you define clusters? It appears that you want all strings at least 80% similar to the current string. Does that mean that if you have A,B,C,D with A and B being clustered you would end up with A=>(B),B=>(A),C=>(),D=>()?? I think your above code would produce A=>(B),B=>(),C=>(),D=>() which doesn't seem right at all because then you can't tell when given B that A is close to it.

It would seem to me that you could end up with A 80% of B, and B 80% of C but A could be only 60% of C...so then do you get A=>(B), B=>(A,C) but I don't think your code above handles that.

I think you need to iterate over every element ever time. So that as you go you would create clusters of all terms centered around the current term. Then move onto the next term and scan them all agian. This of coures makes the solution slower not faster though! ;)

My other thought would be to create a new cluster any time the current term doesn't fall inside a current cluster or falls near its edge. So say you start with A. You then compare B to the only cluster you currently have and find it is 90% similar to A so you add it to A. Then you check C, it is 80% similar to A so you add it to A but its below a threshold (say 85% for this excersise) so you create a new cluster C. Once you have finished scanning all elements for A then you move on to the next cluster and see what elements fall inside of it. I think when you are done you would have general clusters that items could be added to. So after the initial painfull additions, then when you add items instead of scanning 150,000 items for similarity you are scanning 50,000 clusters and creating new clusters as you go. Of coures then balancing your thresholds for joining new clusters or creating your own are very important. Enough rambling from me, but if you had some sample data it would be a fun diversion to mess with! ;) One last thought was to cluster around some dynamicaly generated centers, but I have no clue how to go about that.

Update:It just occured to me that you could also recenter clusters occasionaly picking one string in that cluster that better represented its center. Probably some statistical device you could use to decide when this was needed.


___________
Eric Hodges

In reply to Re: Fast string similarity method by eric256
in thread Fast string similarity method by icanwin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.