in reply to Challenge: Find median string in a list

The algorithm you have is more than adequate. Putting back in the remaining optimization and fixing the >'s gives:
#!/usr/bin/perl use strict; use warnings; use Text::LevenshteinXS; use constant WORD => 0; use constant LEN => 1; use constant MAX => 10_000_000; my @word; while (<DATA>) { tr/\r\n//d; push @word, [$_, 0]; } @word = sort {$a->[WORD] cmp $b->[WORD]} @word; my ($min_word, $min_dist) = ('', MAX); for my $i (0 .. $#word) { next if $word[$i][LEN] >= $min_dist; for my $j (0 .. $#word) { my $dist = distance($word[$i][WORD], $word[$j][WORD]); $word[$i][LEN] = $dist if $dist > $word[$i][LEN]; $word[$j][LEN] = $dist if $dist > $word[$j][LEN]; last if $dist >= $min_dist; } ($min_word, $min_dist) = ($word[$i][WORD], $word[$i][LEN]) if $wor +d[ +$i][LEN] < $min_dist; } print "$min_word\t$min_dist\n"; __DATA__
Thinking about ways to speed it up, it seemed to me that the inner loop should first compare against words suspected to be far from the center, so as to break out as early as possible. I thought about picking random combinations of words and saving the pairs with the greatest distance. Or picking a word, finding the word with the greatest distance from it, finding the word with the greatest distance from that, repeat, then using the last pair of words to test against first in the inner loop. But it turns out that, with the optimizations you had both in, sorting the wordlist by decreasing word length gets the number of distance() calls down to just 86006 on your sample data with 14300 words, in a very short period of time:
@word = sort {length($b->[WORD]) <=> length($a->[WORD])} @word;

Replies are listed 'Best First'.
Re^2: Challenge: Find median string in a list
by Limbic~Region (Chancellor) on Jul 05, 2007 at 01:26 UTC
    ysth,
    At one time or another, I had tried all the tweaks you had provided and wrote them off as micro-optimizations. Combined together they really do give impressive results. Thanks for suggesting I have a second look.

    Cheers - L~R