in reply to Re: Finding The Best Cluster Problem
in thread Finding The Best Cluster Problem

Dear BrowserUK,
A 'cluster' is a rectangular subset of @sim?
Sorry, I'm not sure if I understand what do you mean by rectangular subset.
A cluster C is simply a subset of X and it can have uneven size.
The density of the cluster is the average of the values in the cluster?
Yes, I we use "average" as measure of density.
Which would make the problem: Find __some__ subset with the largest average value; and if two subsets have the same average, favour the largest subset?
Almost. Say C1 has average 0.8 but only 1 member (excluding centroid) , and C2 has average 0.7 with 5 members, we would weight C2 as better than C1.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re^3: Finding The Best Cluster Problem
by BrowserUk (Patriarch) on May 16, 2007 at 11:00 UTC
    Say C1 has average 0.8 but only 1 member (excluding centroid) , and C2 has average 0.7 with 5 members, we would weight C2 as better than C1.

    The problem is, you aren't specifying how you are scoring that.

    1. If you go for the highest average (as I was) then 0.8*1/1 > 0.7 *5/5.
    2. If you go for the highest score, which makes 3.5 beat 0.8, then you will never remove anything from the set because it would reduce the (total) score.

    Put another way:

    • Allowing irregular subsets, the 'cluster' of 9 1.0s that form the major diagonal gives you an average of 1.0 and a size of 9. This can never be beaten, (scoring the average) as adding any lesser value than 1.0 value will decrease the average.
    • But if you use the total, you'll never remove anything from the set as any removal will reduce that total.

    Unless you introduce some other metric or heuristic, you don't have a scoring mechanism that reflects your stated goals?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      the 'cluster' of 9 1.0s that form the major diagonal gives you an average of 1.0

      As mentioned earlier, the self-edges don't count.

      A word spoken in Mind will reach its own level, in the objective world, by its own weight
Re^3: Finding The Best Cluster Problem
by BrowserUk (Patriarch) on May 16, 2007 at 10:29 UTC

    You beat me to it, but my reading of the requirement (inevitably) produces 9 equal subsets with an size of 1 and an average of 1. D'oh!

    However, I'm now confused by how you can have a centroid of an irregularly shaped set, given discrete values?

    Actually, I think I'm completely lost by your meaning now. Could you give an example of two subsets, their relative densities, and how you calculated them?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear BrowserUK,
      Could you give an example of two subsets, their relative densities, and how you calculated them?
      Let's say we have two cluster C1 and C2
      C1 = (s2,s3,m3,w3), with centroid(z) = s2 sim(s2,s3)=0.5 sim(s2,m3)=0.3 sim(s2,w3)=0.36 ----------------+ Total = 0.96 |C1| = 4 Density = Total/|C1| = 0.96/4 = 0.24 we skip including this: sim(s2,s2)=1 We never compare a centroid with itself.
      And C2
      C2 = (s3,m1,w3), with centroid(z) = w3 sim(w3,s3)=0.4 sim(w3,m1)=0.5 ----------------+ Total = 0.9 |C2| = 3 Density = Total/|C2| = 0.9/3= 0.3


      ---
      neversaint and everlastingly indebted.......

        Okay, but the problem is that you are asking for a single maximised cluster. But by my logic, (and the references I've looked at for 'clustering algorithms'), suggest that that doesn't make much sense, without some heuristic that allows you to exclude a value and improve the 'score' of what remains. Otherwise the best single cluster is the one that contains everything.

        For example, this footnote from step 4 of the algorithm description here:

        (*) Of course there is no point in having all the N items grouped in a single cluster but, once you have got the complete hierarchical tree, if you want k clusters you just have to cut the k-1 longest links.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.