neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
Suppose I have the following two variables:
my @X = qw(s1 s2 s3 m1 m2 m3 w1 w2 w3);
and
my @sim = ( # s1 s2 s3 m1 m2 m3 w1 w2 w3 [1.00, 0.92, 0.56, 0.60, 0.45, 0.33, 0.23, 0.23, 0.33], # s1 [0.92, 1.00, 0.50, 0.55, 0.40, 0.33, 0.23, 0.27, 0.36], # s2 [0.56, 0.50, 1.00, 0.60, 0.57, 0.27, 0.20, 0.27, 0.40], # s3 [0.60, 0.55, 0.60, 1.00, 0.82, 0.20, 0.30, 0.40, 0.50], # m1 [0.45, 0.41, 0.57, 0.81, 1.00, 0.19, 0.30, 0.40, 0.50], # m2 [0.33, 0.33, 0.27, 0.20, 0.19, 1.00, 0.25, 0.27, 0.23], # m3 [0.23, 0.23, 0.20, 0.30, 0.30, 0.25, 1.00, 0.75, 0.60], # w1 [0.23, 0.27, 0.27, 0.40, 0.40, 0.27, 0.75, 1.00, 0.80], # w2 [0.33, 0.36, 0.40, 0.50, 0.50, 0.23, 0.60, 0.80, 1.00], # w3 );
As you can see @sim is a pairwise similarity matrix of elements in X. So for example:
sim(s1,s1) = 1 sim(m1,s2) = 0.55
The problem is how can we find a subset of @X, let's call it C(luster), such that the density and the size of that C is maximized? So the size of C can be varied, and we consider every point of X as a centroid.

By maximizing density and size I mean the member of that C should have high similarity among each other AND the member count should be as big as possible, in comparison to other C's. So the output is just 1 best C.

Can anybody suggest a simple algorithm to achieve that in Perl?

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Finding The Best Cluster Problem
by halley (Prior) on May 16, 2007 at 17:25 UTC
    I think a good word to search on google would be "dendrogram." A dendrogram is a graphical representation that lets you explain the best clusters visually; there are most likely some algorithms for generating dendrograms that have code or pseudocode out there for inspiration.

    In general, a dendrogram will connect the closest matches, reorder the set so they're consecutive entries in set X, and collapse these neighbors into a single composite entry and try again. Repeat until the size of set X or the distance between clusters achieves your desired threshold.

    --
    [ e d @ h a l l e y . c c ]

      Thanks a lot for the reply. Do you know if there exist any implementation of DENDOGRAM in Perl?
      I tried CPAN, doesn't seem to find any.

      ---
      neversaint and everlastingly indebted.......
        A dendrogram is a way of visualizing the results of a clustering algorithm. It's not a way to cluster in and of itself.

        There is a CPAN module that will help you - it's Algorithm::Cluster, which contains several approaches to your problem. You'll need a bit of background in cluster analysis - cluster analysis is a good starting point. Good luck!

Re: Finding The Best Cluster Problem
by BrowserUk (Patriarch) on May 16, 2007 at 09:50 UTC

    To clarify (or not :):

    1. A 'cluster' is a rectangular subset of @sim?
    2. The density of the cluster is the average of the values in the cluster?

      Eg. sum( @sim[top..bottom; left..right] ) / (bottom-top) * (right-left); (pretending that we could do 2D slices).

    Which would make the problem: Find the rectangular subset with the largest average value; and if two subsets have the same average, favour the largest subset?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear BrowserUK,
      A 'cluster' is a rectangular subset of @sim?
      Sorry, I'm not sure if I understand what do you mean by rectangular subset.
      A cluster C is simply a subset of X and it can have uneven size.
      The density of the cluster is the average of the values in the cluster?
      Yes, I we use "average" as measure of density.
      Which would make the problem: Find __some__ subset with the largest average value; and if two subsets have the same average, favour the largest subset?
      Almost. Say C1 has average 0.8 but only 1 member (excluding centroid) , and C2 has average 0.7 with 5 members, we would weight C2 as better than C1.

      ---
      neversaint and everlastingly indebted.......
        Say C1 has average 0.8 but only 1 member (excluding centroid) , and C2 has average 0.7 with 5 members, we would weight C2 as better than C1.

        The problem is, you aren't specifying how you are scoring that.

        1. If you go for the highest average (as I was) then 0.8*1/1 > 0.7 *5/5.
        2. If you go for the highest score, which makes 3.5 beat 0.8, then you will never remove anything from the set because it would reduce the (total) score.

        Put another way:

        • Allowing irregular subsets, the 'cluster' of 9 1.0s that form the major diagonal gives you an average of 1.0 and a size of 9. This can never be beaten, (scoring the average) as adding any lesser value than 1.0 value will decrease the average.
        • But if you use the total, you'll never remove anything from the set as any removal will reduce that total.

        Unless you introduce some other metric or heuristic, you don't have a scoring mechanism that reflects your stated goals?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        You beat me to it, but my reading of the requirement (inevitably) produces 9 equal subsets with an size of 1 and an average of 1. D'oh!

        However, I'm now confused by how you can have a centroid of an irregularly shaped set, given discrete values?

        Actually, I think I'm completely lost by your meaning now. Could you give an example of two subsets, their relative densities, and how you calculated them?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.