Re: Fast string similarity method

A couple of additional general thoughts.

Beyond optimizing the similarity computations, it's probably also worth rethinking your clustering algorithm. In particular, if you can't tell a priori what your clusters are (i.e. you can't name a prototypical, representative item for each cluster), but need to extract/estimate them from the data at hand.

Unfortunately - as eric256 tried to hint at - when contemplating matters more thoroughly, even seemingly trivial clustering problems can turn out to be much more involved than most people would think at first. Also, doing it properly is often computationally prohibitively expensive for large input matrices. IOW, the intricacies of the clustering itself are often underestimated. Of course, YMMV and such, but there are many different approaches, and picking the one best suited for the purpose isn't always easy, even for the initiated.

However, the problem domain isn't exactly new, and many books and scientific papers have been written on the subject. For example, the social sciences and biosciences are prominent areas of application, which have instigated a lot of investigations and development of related methodologies. (Having studied psychology during some past period in my life, I've inevitably had some exposure to the topic...)

What I want to say is, in case you're more than "just playing" with the issue, and are serious about getting decent results, find yourself some introductory texts and familiarise yourself briefly with the basic clustering approaches. (Sorry, from the top of my head, I can't think of anything appropriate to link to, but googling should turn up some online resources, or at least pointers to hardcopy publications.) Anyhow, it's certainly a good idea to learn from what others have already done. At least, you'd be wasting less time on reinventing the wheel... :)

For the mathematically inclined, I'd like to mention a paper which describes an efficient clustering technique for huge input matrices. The authors suggest using Singular Value Decomposition (a well-established "tool" from linear algebra) to arrive at a solution for what they call the "generalised (continuous) clustering problem" (as opposed to the more conventional "discrete clustering problem" — see the paper for the differences). More importantly, though, they show how using a random subset from the input data can help to drastically cut down on the computational burden while still obtaining a very good estimate of the final cluster solution for the full data set. As an example application, the paper briefly discusses Latent Semantic Indexing (a topic which in itself might provide alternative perspectives to look at your problem - even if only remotely related to clustering headlines...).

Hope this helps (rather than discourages :)

Cheers,
Almut

BTW, instead of trying to implement any more advanced algorithms yourself from scratch, you probably want to look into something like R, for which there also is a Perl binding (the binding is version 0.02, and I've never used it myself, though).

Comment on Re: Fast string similarity method