Re^3: Cluster a big bunch of strings

I come from a scientific computing background, so letting a machine chew on some code for a month is not computationally infeasible to me. Obviously this is unacceptable if this is to be done on the fly (or if the deadline is next week), but doing this in a one-off fashion and storing the results doesn't seem out there to me.

To get some scale, the following code randomly generates a set of strings and then runs them through the equality operator. The strings vary in length and content, but to reduce early exits, I only use letters a and b. It takes around 60 seconds to run as posted. Assuming N^2 scaling, this would take short of 6 days on my laptop (2.4 GHz). Of course, given the size of the problem, there'll be some slow down due to memory access, but there's certainly plenty of room for optimization in what I posted.

use strict;
use warnings;

my $start_time = time;
my $n_strings = 10_000;
my $min_len_strings = 20;
my $max_len_strings = 100;

# Generate strings
my @number2letter = ('a' .. 'b');
my @strings = ();
undef $strings[$n_strings];
foreach my $string (@strings){
    $string = '';
    for (1 .. $min_len_strings + rand($max_len_strings - $min_len_stri
+ngs)) {
        $string .= $number2letter[int rand @number2letter];
    }
}

# Build comparison matrix
my @matches = ();
undef $matches[$n_strings];
foreach my $match(@matches) {
    $match = [];
    undef $match->[$n_strings];
}

# And compare
for my $i (0 .. $#strings) {
    for my $j (0 .. $#strings) {
        $matches[$i][$j] = $strings[$i] eq $strings[$j] ? 1 : 0;
    }    
}

# And finish
my $tot_matches = 0;
for my $i (0 .. $#strings) {
    for my $j (0 .. $#strings) {
        $tot_matches++ if $matches[$i][$j];
    }    
}
my $time = time - $start_time;
print "I ran for $time seconds, and found $tot_matches matches.";
[download]

Update: For those who are wondering, worst case scenario for the above (all strings given by 'a' x 100) takes 75 seconds on my machine. And just considering lengths is roughly comparable in all cases.

Comment on Re^3: Cluster a big bunch of strings Select or Download Code