comment on

Bioinformaticians deal a lot with a very similar kind of problem. There's this relatively new family of techniques for doing high-throughput experiments that is based on so-called "microarrays". You can think of it is miniaturized labs; each microarray is a matrix of experiments. For example, the rows can correspond to genes, and the columns can corresponds to different patients in a longitudinal study. But there are a zillion variations for what the rows and columns can represent. What they all have in common is that, at the end of the day one has a whole bunch of matrices of numbers (each cell in the matrix corresponds to the measurement yielded by that cell's micro-experiment), and one needs to somehow make sense of it, and the first thing people want to do is exactly what you want to do: come up with some way reorder the rows and/or columns of the array so that "similar" rows and/or columns are close to each other. This reordering process is called "clustering." Here's a picture of a typical microarray, and figure 3 of this paper illustrates the results of a clustering algorithm.

As you can see, the main difference between your problem and what the biologists are dealing with is that your matrix consists of only 1s and 0s, whereas theirs usually hold continuously varying values. I doubt that the clustering methods would differ significant between these two variants of the problem. BTW, the paper I linked to above uses a relatively simple clustering algorithm that would not be too hard to implement in Perl (though it'd probably be too slow). Their implementation is in C++, and it's available here; it's called Cluster, and only a Windows version is available. (It also comes with an unintentionally hilarious license agreement, or at least it used to, whose terms included the condition that forbid licensees for commenting on the quality of the code.)

I looked around for some turnkey implementation of one such clustering algorithm but I could not find anything that was sufficiently streamlined and accessible that it could be easily adapted to your problem. But this is not my area of expertise; maybe some of the bioinformatic-savvy monks will give you some more better pointers.

The other issue is that all the implementations of this sort of stuff that I've seen in Perl usually require something like PDL, because these algorithms tend to rely heavily on vector and matrix computation, and out-of-the-box Perl is just too slow for it (at least for matrices with hundreds of rows and columns).

I'll ask around some more. What's your OS?

the lowliest monk

In reply to Re: Deducing Ideal Groups by tlm
in thread Deducing Ideal Groups by pboin

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.