in reply to Calculating "similarity"

( Correction: the "Vector Space..." article is not actually directly related to the OP's problem but does deal w/ the "closeness of words" concept. In that regard, below is the corrected version of my earlier reply.

UPDATE (Mar 3 2003): If anybody is still interested, i have rounded up some relevant things under "string munging". )

See Vector Space Search Engine article which does some similar things.

The String::Similarity and String::Approx modules may also be of interest. Below are the descriptions (from FreeBSD ports)...

String::Similarity
The "String::Similarity" calculates the similarity index of its two arguments. A value of '0' means that the strings are entirely different. A value of '1' means that the strings are identical. Everything else lies between 0 and 1 and describes the amount of similarity between the strings.

String::Approx
String::Approx lets you match and substitute strings approximately. With this you can emulate errors: typing errors, spelling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot).