in reply to Cluster a big bunch of strings

While you start out with a large number of strings, you could certainly reduce the number of required comparisons by precomputing some metrics. For example, checking for equivalence is fairly cheap. As well, assuming significant variability in title length, if you pick a fixed metric of a Levenshtein distance of 3, two strings with lengths different by 4 or more could be dropped immediately.

I'm a little surprised that 1 trillion comparisons is computationally infeasible. Are you trying to do this dynamically? If so, it would seem your best bet would be caching results in a DB, thus reducing the problem to one large initial data crunch followed by a much smaller insertion operation for new additions.

My last thought is that probability of a typographical error is proportional to string length, so you may want to use a relative distance rather than an absolute one.

Replies are listed 'Best First'.
Re^2: Cluster a big bunch of strings
by educated_foo (Vicar) on Mar 30, 2009 at 17:17 UTC
    I'm a little surprised that 1 trillion comparisons is computationally infeasible.
    Why? 1GHz is one billion cycles per second, so 1 trillion CPU instructions will take about 10 minutes. Given that a string comparison with multiple function calls will take at least hundreds or thousands of CPU cycles, this will take at least a day or so. Probably a lot more.

    If you want to do fast, approximate comparison of strings, a lot of work has been done on that in bioinformatics. In particular, the popular BLAST and PSI-BLAST algorithms more or less solve this problem by creating an index of short substrings, then doing a full comparison where they match. You might try to develop something similar.

      I come from a scientific computing background, so letting a machine chew on some code for a month is not computationally infeasible to me. Obviously this is unacceptable if this is to be done on the fly (or if the deadline is next week), but doing this in a one-off fashion and storing the results doesn't seem out there to me.

      To get some scale, the following code randomly generates a set of strings and then runs them through the equality operator. The strings vary in length and content, but to reduce early exits, I only use letters a and b. It takes around 60 seconds to run as posted. Assuming N^2 scaling, this would take short of 6 days on my laptop (2.4 GHz). Of course, given the size of the problem, there'll be some slow down due to memory access, but there's certainly plenty of room for optimization in what I posted.

      Update: For those who are wondering, worst case scenario for the above (all strings given by 'a' x 100) takes 75 seconds on my machine. And just considering lengths is roughly comparable in all cases.