cutcopy11 has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to cluster a large amount of pair-wise data that represent coordinates in a symmetrical matrix where each element in the matrix represents a “similarity” value between the two data pairs. I have to cluster the data when the value is greater than 60.

Lets pretend there are 10 pieces of data (0 through 9). So, the matrix would have 100 elements. I already wrote a program that extracts the coordinates and corresponding values of the matrix when this value is greater than 60. Since the matrix is symmetrical, I only extracted values when Y > X.

X Y Value
0 1 70
0 2 80
1 2 60
1 5 90
2 3 100
6 7 90
7 8 100

So, cluster #1 will be 0,1,2,3,5.

cluster #2 will be 6,7,8

Then I will add in later …

cluster #3 will be 4

cluster #4 will be 9

I am weak on passing information to and from subroutines.

I would like for the data to be in this data structure.

$cluster_60[$cluster_element]{$cluster_key}

So,

$cluster_60[4]{1} = 5

$cluster_60[0]{3} = 4

$cluster_60[1]{2} = 8

I would like to be able to access 5 by simply writing outside of the subroutine

$cluster_60[4]{1}.

Please help me, Perl Monks! Thanks, Clayton

Replies are listed 'Best First'.
Re: Pair-Wise Clustering
by ELISHEVA (Prior) on Oct 01, 2009 at 03:54 UTC

    Please see I know what I mean. Why don't you?. It isn't at all clear from your sample data and examples what is meant by "cluster_element" and "cluster_key". I'm not at all following from your sample data why cluster #3 would be 4 (4 never appears in your sample data table).

    Pair wise clustering is an NP-complete problem. There are a variety of ways you can solve this in a reasonable amount of time for small numbers of data points (search the web), but to understand and code up those algorithms will require clear definitions of terms. It pays especially in this case to be clear to yourself and others about exactly what you are trying to do and how you are naming the various parts of your problem.

    Best, beth

      Say I have 5 sequences that are 5 nucleotides long: AAAAA, AAAAC, AAGGC, AGGGG, CCCCC, which represent elements 0 through 4. The similarity matrix would look like this:
      100 80 40 20 0
      80 100 60 20 20
      40 60 100 60 20
      20 20 60 100 0
      0 20 20 0 100

      Because there are 5 pieces of data (0 through 4), the matrix has 25 elements. I already wrote a program that extracts the coordinates and corresponding values of the matrix when this value is greater than 60. Since the matrix is symmetrical, I only extracted values when columns > rows. This is what my program extracts and puts in a text file:

      0 1 80
      1 2 60
      2 3 60

      So, there are two clusters. Cluster#1 is 0,1,2,3 and Cluster#2 is 4. Just as data for 4 and 9 are not in the table of the previous example, the data for sequence 4 in this example is not in the table above, which represents the text file that I am working from, but I still have to use these lonely elements as cluster singularities. Anyway, I am working with a huge matrix that is much more complex than this, and I have to figure out a clustering algorithm for it.

       $cluster_60[2]{0} = 2; $cluster_60[0]{1} = 4

      Thanks for your help, everyone. I am doing my best to explain my problem. This is a small part of a class project. Its due in over a week. So, I'm not stressed out about it. I don't care about my grade in this class because I am a grad student who really only cares about papers, but I get obsessed when I can't figure something out related to writing code. I use crude perl scripts to analyze biological data and might actually get a paper out of it here soon. I learned Perl over a year ago on my own, and now, my advisor makes me use it everyday and take bioinformatic classes where I have to write scripts. I am really delerious right now because I am tired. The most recent Bill Maher episode with Michael Moore, that Princeton economic nobel laureate, former NY gov Elliot Spitzer, and John Waters was hilarious. I hope that I can become a Perl Monk someday, but I still suck at references and subroutines and have very limited experience with MySQL unlike tye.

        By creating an account and participating you are, by definition, a monk, but that probably isn't what you mean. Even to become a "real monk", the only criterion is a passion for learning. Everything else, references, subroutines, and so on, will come with study and practice. Your advisor seems to be making sure you get plenty of the later. There is a rich supply of material here on this site that will help you get more comfortable with these things: both answers and questions by others tripped up by the same things tripping you up. Be sure to use Super Search to explore further.

        As for clustering algorithms, a google search would be in order, but that will only give the algorithms, not how to implement them in Perl. You will need to get comfortable with subroutines, references and data structures to finish this particular project. The algorithms you need will be very hard to implement without them - see perldata, perldsc, perlreftut, and perlsub for starters. I'm sure others will have even better suggestions. It may seem like a lot of work to climb over that learning hump in a week, but I doubt your advisor would have made you take this course if he didn't believe in you. You can do it.

        A while back (many months ago) someone posted a question on clustering algorithms in a networking context. The goal was to find the optimal subdivision of N machines into 2 clusters. A number of people posted coded solutions. Although it is a slightly different problem, you may want to look at that thread for a discussion of efficiency issues related to symmetrical matrices and examples of how to translate various clustering algorithms to code. See "Divide" challenge

        You may also want to check out the following CPAN modules

        Best, beth

        Dude, your enthusiasm is contagious, being from a bioinformatics background, however, I can't follow what you're trying to say. You plotted this similarity matrix, there're 4 pieces of data and the matrix has 25 elements is obvious from the context but what is the weight function for each of these nucleotides, since you said pairwise, which two segments are aligned against each other and which pair score is assigned to each nucleotide pairs in the combination. You "wrote a program that extracts the coordinates and corresponding values of the matrix when this value is greater than 60", show us some relevant excerpts from it.

        "the data for sequence 4 in this example is not in the table above"
        Don't you think it might be imperative that the table has representations of the sequences you've provided instead of showing us the sequences but showing a different table of perhaps different similarity scores. I tried mapping the sequences to the matrix to get the scores associated with each nucleotide but without the scoring weight I can't of course know how two nucleotides evaluation is assigned values representing the strength of the pairwise similarity checks.

        You said there're two clusters, please shed light into how these clusters are deduced. I am asking all these questions so I can help clarifying the problem because probably someone can come with a more optimal approach than the one you're following or can guide you to what the right thing to do is.

        Finally, "I am doing my best to explain my problem".
        And no doubt that we'd do our best to understand and help solve the problem each in our own capacities. Check these ( First link, recursive formula.), and maybe this one More efficient way to lookup with 2 AoA's. for some initial warm-up.

        Welcome to Perl Monks and have a happy Perl journey.

        Update: subroutines and references are easy concepts to pick, if you give yourself 10 days to sit for them., you would know another definition for the sentence "Powerful Fun".


        Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
        I have to figure out a clustering algorithm for it

        See cluster analysis for a very coarse introduction.

Re: Pair-Wise Clustering
by Anonymous Monk on Oct 01, 2009 at 00:37 UTC
    I would like for the data to be in this data structure...

    That doesn't explain what kind of structure you want. Create sample data structure you want, and print it with Dumper, example

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dumper; my( @cluster_60 ) = ( { 3 => 4 }, { 2 => 8 }, {}, {}, { 1 => 5 }, ); $cluster_60[2]{hot} = 'dog'; print Dumper( \@cluster_60); __END__ $VAR1 = [ { '3' => 4 }, { '2' => 8 }, { 'hot' => 'dog' }, {}, { '1' => 5 } ];
    Data Types and Variables, References quick reference

      I need to write a code that determines the clusters from the table. From the table above, I would get four clusters. Cluster#1 would have 5 elements. Cluster#2 would have 3 elements. Cluster 3 and 4 would have 1 element each. I would like to stores these elements of each cluster in the form below

       $cluster_60[element]{cluster#}

      After after adding all the elements to each cluster, then the following could be obtained

       $cluster_60[0]{3} = 9

      I need some sort of hierarchial/single-linkage algorithm, but I can't find exactly what I need.

        Your word explanations aren't sinking into my brain. Is the perl data structure I showed the one you want or not?
Additional Info for Cluster Question
by cutcopy11 (Initiate) on Oct 01, 2009 at 00:06 UTC
    By the way, I am starting from a text file that is tab-delimited that is in the form of the table below. I have attempted to write the code for a couple hours, but I get stuck on the logic. I don't know if I need a recursive function or what and I wouldn't know how to make one for this if that is what is needed. -Clayton
Re: Pair-Wise Clustering
by SuicideJunkie (Vicar) on Oct 02, 2009 at 13:46 UTC

    I'm not seeing the logic behind that output datastructure.

    Some questions:

    • Why is the cluster ID in a hash, when it is a simple integer count? An array would make sense to me.
    • Why is the element before the cluster? Not all clusters will have the same number of elements, and as is, you can't separate one cluster from the rest and pass it to a sub.
    • Are there any advantages to the current order that I do not see? You can get a hash of clusters that have an Nth element, but Nth elements don't mean much since the order of elements in a cluster is arbitrary.

    The way I would do this is an AoA structure, with the indexes reversed:

    my $data = shift; my @clusters; my $elementsAlreadyUsed = []; while (my $nextCluster = findNextCluster($data, $elementsAlreadyUsed)) { push @clusters, nextCluster; } # Loop over each cluster for my $clusterID (0.. $#clusters) { print "Cluster $clusterID (#". ($clusterID +1) . ")\n"; # Print the elements contained in the cluster print "\tContains element(s): ". (join ', ', @{$clusters[$clusterID] +}) . "\n"; } sub findNextCluster { my $data = shift; my $elementsUsed = shift; # Pick first unused element to start from. # Return undef if all elements are used. # DFS or BFS over data connected to the start element; return BFSResultArrayRef($data, $startElement); }
    Thusly, $someElement = $clusters[$clusterIndex][$elementIndex];
    and $arrayRefOfAllElementsInCluster = $clusters[$clusterIndex];
    and $numElementsInCluster = scalar @{$clusters[$clusterIndex]};