in reply to Closest matches from string array
Cause it's fun (aka I like fuzzy logic).....
use Text::Soundex; my @names = qw( McGee MacGee Magee MacGeady Mackintosh McIntosh Griffin Griffith Griffis Griffey Grifferty McGrifferty O'Griffey O'Griffin ); my %hash; $hash{$_} = soundex($_) for @names; printf "%-15s => %s\n", $_, $hash{$_} for sort keys %hash; my @tests = qw( Griffin McGee McGinley Smith ); for my $name( @tests ) { my $soundex = soundex($name); # you can make the search fuzzy in different ways..... my $bit_fuzzy = substr $soundex, 0, 2; my $mid_fuzzy = substr $soundex, 1, 2; print "\nTesting $name ($soundex) ($bit_fuzzy) ($mid_fuzzy)\n\n"; for my $test( keys %hash ) { print "\t$test\n" if $hash{$test} eq $soundex; } print $/; for my $test( keys %hash ) { print "\t$test (bit fuzzy)\n" if $hash{$test} =~ m/$bit_fuzzy. +./; } print $/; for my $test( keys %hash ) { print "\t$test (mid fuzzy)\n" if $hash{$test} =~ m/.$mid_fuzzy +./; } print $/; } __DATA__ Grifferty => G616 Griffey => G610 Griffin => G615 Griffis => G612 Griffith => G613 MacGeady => M230 MacGee => M200 Mackintosh => M253 Magee => M200 McGee => M200 McGrifferty => M261 McIntosh => M253 O'Griffey => O261 O'Griffin => O261 Testing Griffin (G615) (G6) (61) Griffin Griffin (bit fuzzy) Griffis (bit fuzzy) Grifferty (bit fuzzy) Griffith (bit fuzzy) Griffey (bit fuzzy) Griffin (mid fuzzy) Griffis (mid fuzzy) Grifferty (mid fuzzy) Griffith (mid fuzzy) Griffey (mid fuzzy) Testing McGee (M200) (M2) (20) McGee MacGee Magee McGrifferty (bit fuzzy) McGee (bit fuzzy) MacGee (bit fuzzy) Magee (bit fuzzy) Mackintosh (bit fuzzy) McIntosh (bit fuzzy) MacGeady (bit fuzzy) McGee (mid fuzzy) MacGee (mid fuzzy) Magee (mid fuzzy) Testing McGinley (M254) (M2) (25) McGrifferty (bit fuzzy) McGee (bit fuzzy) MacGee (bit fuzzy) Magee (bit fuzzy) Mackintosh (bit fuzzy) McIntosh (bit fuzzy) MacGeady (bit fuzzy) Mackintosh (mid fuzzy) McIntosh (mid fuzzy) Testing Smith (S530) (S5) (53)
As you will note from the results the answers are close to what you want. They do highlight some logical issues in your thinking. In one case you effectively demand ignoring the first letter, in the next you assign it importance. No matter what you use for approximation things will get weird if the first letter is WRONG as a few of the mid_fuzzy results show. Also note that if you go for a 2 digit fuzzy match as shown then you will (on average) pull 1/10*10 ie 1% of your DB every time for mid_fuzzy and 1/26*10 ie 0.4% or your DB with the 2 char bit fuzzy. If you only have 1000 records this is probably not a problem. It is a problem if you have 100,000 - 1,000,000 odd records as a list of 1000-10,000 possibilities is a bit overwhelming. If you want a fuzzy search you would typically have a fuzzometer to let the client make the search progressively more fuzzy the more desperate they get to find whatever!
Combining two concurrent fuzzy searches is a potent technique. Even if each search pulls 1% of the DB the union will be much smaller so you will typically reduce the result set by 1,2,3 orders of magnitude. Fuzzy last name and initial would reduce the result set by a factor of roughly 26 (more like 20 but you get the idea).
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
|
|---|