Re: diff (different enough)

I think the place where you could really get some performance enhancements is in the outer loop. From the sounds of things right now you are comparing every element in your set to every other element in the set (O(n^2)). With a large dataset like you have that will be slow no matter how fast the difference algorithm is.

If you could use some sort of hueristic to divide the data into subsets, and only compare the values in the subsets to each other. Maybe only comparing every string to the other strings in the set that are within N characters of its length. i.e. if your base string is 10 characters long and you have 2 characters of slack, then only compare your string to the 8 thru 12 character long strings in the original dataset.

Is the data you are analyzing really arbitrary "n-char strings" or it is something more structured like english text? If it is english text you could use something like Text::Metaphone or Text::Soundex to start off and only comparing things that have simimlar soundex (or metaphone) values. If your data is some other type of data with a semi-regular structure (names, mailing addresses, e-mail addresses, etc..) there are other good hueristics you could use to prevent a BFI cartesian-product comparison as you are doing now.

While I'm thinking about this, are you loading the whole dataset into memory and processing it at-once? If not and you have the memory to load the whole dataset into RAM then doing so will give you a significant performance increase.

Comment on Re: diff (different enough)