in reply to Closest matches from string array

Soundex is probably your kiddy for this, but you might get better results if you stripped common prefixes like Mc, Mac, O' and maybe a few others. That would probably allow you to tighten the match and still not miss likely hits.

You said this was coming from a DB. You ought to consider storing the Soundex values within the table as an alternate key. That would save regenerating the Soundex each time, and allow you to use SQL to only pull the likely candidates, rather than pulling them all and doing the selection yourself.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!

Replies are listed 'Best First'.
Re: Re: Closest matches from string array
by tachyon (Chancellor) on Oct 28, 2003 at 11:36 UTC

    Aye laddie, the heathen Scots ave a lot te answer fur. 'Mace' =~ s/ma?c/i :-) Oi who is 'e'

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      I was thinking of something along the lines of

      $soundex{ $name } = soundex($1) if $name =~ m[^(?:O'|Mac|Mc)([A-Z].*$)];

      Which would only work if the list is nicely capitalised, which going by the few BazB posted, it seems that his data might be. The idea was that a search for Connor or Keefe, would also find O'Connor and O'Keefe, which using unassisted soundex wouldn't find.

      #! perl -slw use strict; use Text::Soundex; printf "%20s : %s : %s\n", $_, m[^(?:O'|Mac|Mc)([A-Z].*$)] ? soundex($1) : soundex($_), so +undex($_) for qw[ Connor O'Connor Keefe O'Keefe MacDonald McDonald Donald Donaldson O'Donnell Donagal O'Dona +gal ]; __END__ P:\test>test Connor : C560 : C560 O'Connor : C560 : O256 Keefe : K100 : K100 O'Keefe : K100 : O210 MacDonald : D543 : M235 McDonald : D543 : M235 Donald : D543 : D543 Donaldson : D543 : D543 O'Donnell : D540 : O354 Donagal : D524 : D524 O'Donagal : D524 : O352

      The first column of soundex codes are the assisted ones, the second unassisted. You can see what a difference it makes.

      That said. It would screw up sound alikes Magee and MacGee, so it may not be such a good idea. Another possibility would be to store two codes for some names, but then you move into needing to normalise the soundex codes into another table. Which wouldn't be a bad thing, but does add complication.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!

        Just stirring you along a bit. 'Mace' and 'Mack' 'Macca' are just about all the McEdge cases.

        If you look at the algorithm you will see it is just this:

        ($f) = /^(.)/; tr/AEHIOUWYBFPVCGJKQSXZDTLMNR/00000000111122222222334556/; ($fc) = /^(.)/; s/^$fc+//; tr///cs; tr/0//d; $_ = $f . $_ . '000'; s/^(.{4}).*/$1/;

        While there is no doubt it gives useful results it is exceptionally simplistic. The problem you are trying to solve is threefold:

        1. Simple mis-spelling
        2. Pronuncitation Differences
        3. Dropped sylables

        To a large extent the Knuth algorithm deals reasonably successfully with 1 and 2. Where it falls over is when there is either a mis-spelling of the first letter or a dropped sylable. As you could see in the example I fuzzed the matches by dropping symbols from the code to deal with this, given that there are only 6 numeric codes in use my estimate that it would on average pull 1% of the DB should have read more like 3% if you shorten the codes.

        In my picture of a 'better' algorithm I imagine a mapping to perhaps [A-Z]{4} so you could perhaps do linear displacement rather than just dropping.....Oh well back to the real work Bayes filters are the soup de jour.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print