Re: Fast string similarity method

I think your speed problem is a result the fact that your method is O(N**2), as you compare every string with every other string.

I wonder if it's not possible to use a form of "stemming", i.e. reduce the strings to a bare, basic form (the stem), and compare those instead. In the simple case where things are similar if and only if their stem is the same (think of SoundEx), you should be reducing the algorithm order to O(N). You may then expect a program speedup, for that range of number, of several orders of magnitude.

Even when the case is not so simple, it might still be a helpful approach. Perhaps you can reduce the stems even more, for items that are only remotely similar, and extract a second grade similarity that way?

Comment on Re: Fast string similarity method