The obvious idea of using R^^2 as the range for the hashing function, doesn't inspire anything obvious because the similarity function clearly isn't associative across three strings as someone already freaked out about, therefore either more than two dimensions are needed, or a more esoteric space (a little googling against "similarity hashing" suggests that BOTH have been applied ie. fractal geometries and multi-dimensional metric spaces in the mix). In the similar problem of clustering species by similarity of genes, one approach was to use dissimilarity to construct a hierarchy of categories and have the actual data "find each other's similarity" further down the tree (see http://math.berkeley.edu/~datta/stanfordtalk.pdf). All of the work is recent (less than a decade old) and the most promising has been applied to "wikipedia in the pocket", based on this article. wikipedia itself only mentions the algorithm you already have, a.k.a the Rabin-Karp algorithm, which looks rather clunky compared to the more whizz-bang-maths efforts to pre-process a single string into its correct location in a metric space (do leave us a module on your way out ;))
^M Free your mind!
In reply to Re: Fast string similarity method
by Moron
in thread Fast string similarity method
by icanwin
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |