in reply to Fuzzy matching of text strings

I'm not sure about other databases, but Microsft SQL Server has a function called SOUNDEX, which returns a "similarity" string that you can use for comparisons.

e.g.
SELECT description, SOUNDEX(description) AS SOUNDEX FROM mytable description SOUNDEX -------------------------------------------- ------- Promessa High Spirits P652 Promessa high Spirits P652 Promessa Hgh Spirits P652 Promessa high spirits P652 And now for something completely different A530 And now, for something completely different A530 And now! Something completely different A530 Something completely different now S535
Description of SOUNDEX from manual:
SOUNDEX converts an alpha string to a four-character code to find similar-sounding words or names. The first character of the code is the first character of character_expression and the second through fourth characters of the code are numbers. Vowels in character_expression are ignored unless they are the first letter of the string. String functions can be nested.

Replies are listed 'Best First'.
Re^2: Fuzzy matching of text strings
by buttroast (Scribe) on Dec 14, 2005 at 19:22 UTC
    Soundex is a great tool, but in this case it is not doing anything. The reason the first four descriptions in your sample return the same soundex code is because they only processed the "Promess" portion of each record.

    Basically:

    1. Grab the first letter:

    String: Promessa H...
    Soundex: P

    2. Remove all vowels in remaining string:

    String: rmssH
    Soundex: P

    3. Condense duplicate letters:


    String: rmsH
    Soundex: P

    4. Assign 3 digits from l-r based on following key:

    1. b,p,f,v
    2. c,s,k,g,i,q,x,z
    3. d,t
    4. l
    5. m,n
    6. r

    String: rmsH
    Soundex: P6 (6 is for r)

    String: msH
    Soundex: P65 (5 is for m)

    String: sH
    Soundex: P652 (2 is for s)

    DONE AT 3 DIGITS!!! GO NO FURTHER.

    If there are consecutive characters from the same group, such as in the name "Duck", (c and k are both in group 2), the resulting soundex would be D200 (zeros are added to pad right if we run out of letters to change to numbers).

    In summary, soundex is not appropriate for longer strings comparison. If you use it, the following would all be grouped as P652:

    Promessa National Bank
    Promessing Fertilizer Company
    Promessa High Spirits
    Promessing With Me

    Hope this clears up Soundex for everyone.

    Thanks buttroast
Re^2: Fuzzy matching of text strings
by GrandFather (Saint) on Dec 15, 2005 at 03:08 UTC

    Soundex was designed for matching names and is not very usefull as a general tool. Abbreviate english words may help generating a "canonical" form of a phrase.


    DWIM is Perl's answer to Gödel