Re^4: Finding The Best Cluster Problem

Dear BrowserUK,

Could you give an example of two subsets, their relative densities, and how you calculated them?

Let's say we have two cluster C1 and C2

C1 = (s2,s3,m3,w3), with centroid(z) = s2

sim(s2,s3)=0.5
sim(s2,m3)=0.3
sim(s2,w3)=0.36
----------------+
   Total = 0.96
   |C1| = 4
Density = Total/|C1| = 0.96/4 = 0.24

we skip including this: sim(s2,s2)=1
We never compare a centroid with itself.
[download]

And C2

C2 = (s3,m1,w3), with centroid(z) = w3

sim(w3,s3)=0.4
sim(w3,m1)=0.5
----------------+
   Total = 0.9
   |C2| = 3
Density = Total/|C2| = 0.9/3= 0.3
[download]

---
neversaint and everlastingly indebted.......

Comment on Re^4: Finding The Best Cluster Problem Select or Download Code

Replies are listed 'Best First'.
Re^5: Finding The Best Cluster Problem by BrowserUk (Patriarch) on May 16, 2007 at 14:20 UTC
Okay, but the problem is that you are asking for a single maximised cluster. But by my logic, (and the references I've looked at for 'clustering algorithms'), suggest that that doesn't make much sense, without some heuristic that allows you to exclude a value and improve the 'score' of what remains. Otherwise the best single cluster is the one that contains everything. For example, this footnote from step 4 of the algorithm description here: (*) Of course there is no point in having all the N items grouped in a single cluster but, once you have got the complete hierarchical tree, if you want k clusters you just have to cut the k-1 longest links. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]

Replies are listed 'Best First'.

Re^5: Finding The Best Cluster Problem
by BrowserUk (Patriarch) on May 16, 2007 at 14:20 UTC

Okay, but the problem is that you are asking for a single maximised cluster. But by my logic, (and the references I've looked at for 'clustering algorithms'), suggest that that doesn't make much sense, without some heuristic that allows you to exclude a value and improve the 'score' of what remains. Otherwise the best single cluster is the one that contains everything.

For example, this footnote from step 4 of the algorithm description here:

(*) Of course there is no point in having all the N items grouped in a single cluster but, once you have got the complete hierarchical tree, if you want k clusters you just have to cut the k-1 longest links.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

[reply]