in reply to Re: Pair-Wise Clustering
in thread Pair-Wise Clustering

Say I have 5 sequences that are 5 nucleotides long: AAAAA, AAAAC, AAGGC, AGGGG, CCCCC, which represent elements 0 through 4. The similarity matrix would look like this:
100 80 40 20 0
80 100 60 20 20
40 60 100 60 20
20 20 60 100 0
0 20 20 0 100

Because there are 5 pieces of data (0 through 4), the matrix has 25 elements. I already wrote a program that extracts the coordinates and corresponding values of the matrix when this value is greater than 60. Since the matrix is symmetrical, I only extracted values when columns > rows. This is what my program extracts and puts in a text file:

0 1 80
1 2 60
2 3 60

So, there are two clusters. Cluster#1 is 0,1,2,3 and Cluster#2 is 4. Just as data for 4 and 9 are not in the table of the previous example, the data for sequence 4 in this example is not in the table above, which represents the text file that I am working from, but I still have to use these lonely elements as cluster singularities. Anyway, I am working with a huge matrix that is much more complex than this, and I have to figure out a clustering algorithm for it.

 $cluster_60[2]{0} = 2; $cluster_60[0]{1} = 4

Thanks for your help, everyone. I am doing my best to explain my problem. This is a small part of a class project. Its due in over a week. So, I'm not stressed out about it. I don't care about my grade in this class because I am a grad student who really only cares about papers, but I get obsessed when I can't figure something out related to writing code. I use crude perl scripts to analyze biological data and might actually get a paper out of it here soon. I learned Perl over a year ago on my own, and now, my advisor makes me use it everyday and take bioinformatic classes where I have to write scripts. I am really delerious right now because I am tired. The most recent Bill Maher episode with Michael Moore, that Princeton economic nobel laureate, former NY gov Elliot Spitzer, and John Waters was hilarious. I hope that I can become a Perl Monk someday, but I still suck at references and subroutines and have very limited experience with MySQL unlike tye.

Replies are listed 'Best First'.
Re^3: Pair-Wise Clustering
by ELISHEVA (Prior) on Oct 01, 2009 at 05:29 UTC

    By creating an account and participating you are, by definition, a monk, but that probably isn't what you mean. Even to become a "real monk", the only criterion is a passion for learning. Everything else, references, subroutines, and so on, will come with study and practice. Your advisor seems to be making sure you get plenty of the later. There is a rich supply of material here on this site that will help you get more comfortable with these things: both answers and questions by others tripped up by the same things tripping you up. Be sure to use Super Search to explore further.

    As for clustering algorithms, a google search would be in order, but that will only give the algorithms, not how to implement them in Perl. You will need to get comfortable with subroutines, references and data structures to finish this particular project. The algorithms you need will be very hard to implement without them - see perldata, perldsc, perlreftut, and perlsub for starters. I'm sure others will have even better suggestions. It may seem like a lot of work to climb over that learning hump in a week, but I doubt your advisor would have made you take this course if he didn't believe in you. You can do it.

    A while back (many months ago) someone posted a question on clustering algorithms in a networking context. The goal was to find the optimal subdivision of N machines into 2 clusters. A number of people posted coded solutions. Although it is a slightly different problem, you may want to look at that thread for a discussion of efficiency issues related to symmetrical matrices and examples of how to translate various clustering algorithms to code. See "Divide" challenge

    You may also want to check out the following CPAN modules

    Best, beth

Re^3: Pair-Wise Clustering
by biohisham (Priest) on Oct 01, 2009 at 13:28 UTC
    Dude, your enthusiasm is contagious, being from a bioinformatics background, however, I can't follow what you're trying to say. You plotted this similarity matrix, there're 4 pieces of data and the matrix has 25 elements is obvious from the context but what is the weight function for each of these nucleotides, since you said pairwise, which two segments are aligned against each other and which pair score is assigned to each nucleotide pairs in the combination. You "wrote a program that extracts the coordinates and corresponding values of the matrix when this value is greater than 60", show us some relevant excerpts from it.

    "the data for sequence 4 in this example is not in the table above"
    Don't you think it might be imperative that the table has representations of the sequences you've provided instead of showing us the sequences but showing a different table of perhaps different similarity scores. I tried mapping the sequences to the matrix to get the scores associated with each nucleotide but without the scoring weight I can't of course know how two nucleotides evaluation is assigned values representing the strength of the pairwise similarity checks.

    You said there're two clusters, please shed light into how these clusters are deduced. I am asking all these questions so I can help clarifying the problem because probably someone can come with a more optimal approach than the one you're following or can guide you to what the right thing to do is.

    Finally, "I am doing my best to explain my problem".
    And no doubt that we'd do our best to understand and help solve the problem each in our own capacities. Check these ( First link, recursive formula.), and maybe this one More efficient way to lookup with 2 AoA's. for some initial warm-up.

    Welcome to Perl Monks and have a happy Perl journey.

    Update: subroutines and references are easy concepts to pick, if you give yourself 10 days to sit for them., you would know another definition for the sentence "Powerful Fun".


    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.

      The coordinates and corresponding values appear to be the row and column numbers. The 0, 1, 80 in his chart would mean row 0, column 1, value is 80.

      The clusters sound like a list of all the separate connected subgraphs. Where there are 5 nodes in the example, and a pair of nodes are connected if(M,N)>=60 in the chart.

      0 is connected to 1, 1 is connected to 2, and 2 to 3. 4 is connected to nothing, so you've got two independent sets of nodes; 0,1,2,3 vs 4.

Re^3: Pair-Wise Clustering
by Anonymous Monk on Oct 01, 2009 at 05:23 UTC
    I have to figure out a clustering algorithm for it

    See cluster analysis for a very coarse introduction.