in reply to Fast string similarity method
How do you define clusters? It appears that you want all strings at least 80% similar to the current string. Does that mean that if you have A,B,C,D with A and B being clustered you would end up with A=>(B),B=>(A),C=>(),D=>()?? I think your above code would produce A=>(B),B=>(),C=>(),D=>() which doesn't seem right at all because then you can't tell when given B that A is close to it.
It would seem to me that you could end up with A 80% of B, and B 80% of C but A could be only 60% of C...so then do you get A=>(B), B=>(A,C) but I don't think your code above handles that.
I think you need to iterate over every element ever time. So that as you go you would create clusters of all terms centered around the current term. Then move onto the next term and scan them all agian. This of coures makes the solution slower not faster though! ;)
My other thought would be to create a new cluster any time the current term doesn't fall inside a current cluster or falls near its edge. So say you start with A. You then compare B to the only cluster you currently have and find it is 90% similar to A so you add it to A. Then you check C, it is 80% similar to A so you add it to A but its below a threshold (say 85% for this excersise) so you create a new cluster C. Once you have finished scanning all elements for A then you move on to the next cluster and see what elements fall inside of it. I think when you are done you would have general clusters that items could be added to. So after the initial painfull additions, then when you add items instead of scanning 150,000 items for similarity you are scanning 50,000 clusters and creating new clusters as you go. Of coures then balancing your thresholds for joining new clusters or creating your own are very important. Enough rambling from me, but if you had some sample data it would be a fun diversion to mess with! ;) One last thought was to cluster around some dynamicaly generated centers, but I have no clue how to go about that.
Update:It just occured to me that you could also recenter clusters occasionaly picking one string in that cluster that better represented its center. Probably some statistical device you could use to decide when this was needed.
|
---|