Cause it's fun (aka I like fuzzy logic).....

use Text::Soundex; my @names = qw( McGee MacGee Magee MacGeady Mackintosh McIntosh Griffin Griffith Griffis Griffey Grifferty McGrifferty O'Griffey O'Griffin ); my %hash; $hash{$_} = soundex($_) for @names; printf "%-15s => %s\n", $_, $hash{$_} for sort keys %hash; my @tests = qw( Griffin McGee McGinley Smith ); for my $name( @tests ) { my $soundex = soundex($name); # you can make the search fuzzy in different ways..... my $bit_fuzzy = substr $soundex, 0, 2; my $mid_fuzzy = substr $soundex, 1, 2; print "\nTesting $name ($soundex) ($bit_fuzzy) ($mid_fuzzy)\n\n"; for my $test( keys %hash ) { print "\t$test\n" if $hash{$test} eq $soundex; } print $/; for my $test( keys %hash ) { print "\t$test (bit fuzzy)\n" if $hash{$test} =~ m/$bit_fuzzy. +./; } print $/; for my $test( keys %hash ) { print "\t$test (mid fuzzy)\n" if $hash{$test} =~ m/.$mid_fuzzy +./; } print $/; } __DATA__ Grifferty => G616 Griffey => G610 Griffin => G615 Griffis => G612 Griffith => G613 MacGeady => M230 MacGee => M200 Mackintosh => M253 Magee => M200 McGee => M200 McGrifferty => M261 McIntosh => M253 O'Griffey => O261 O'Griffin => O261 Testing Griffin (G615) (G6) (61) Griffin Griffin (bit fuzzy) Griffis (bit fuzzy) Grifferty (bit fuzzy) Griffith (bit fuzzy) Griffey (bit fuzzy) Griffin (mid fuzzy) Griffis (mid fuzzy) Grifferty (mid fuzzy) Griffith (mid fuzzy) Griffey (mid fuzzy) Testing McGee (M200) (M2) (20) McGee MacGee Magee McGrifferty (bit fuzzy) McGee (bit fuzzy) MacGee (bit fuzzy) Magee (bit fuzzy) Mackintosh (bit fuzzy) McIntosh (bit fuzzy) MacGeady (bit fuzzy) McGee (mid fuzzy) MacGee (mid fuzzy) Magee (mid fuzzy) Testing McGinley (M254) (M2) (25) McGrifferty (bit fuzzy) McGee (bit fuzzy) MacGee (bit fuzzy) Magee (bit fuzzy) Mackintosh (bit fuzzy) McIntosh (bit fuzzy) MacGeady (bit fuzzy) Mackintosh (mid fuzzy) McIntosh (mid fuzzy) Testing Smith (S530) (S5) (53)

As you will note from the results the answers are close to what you want. They do highlight some logical issues in your thinking. In one case you effectively demand ignoring the first letter, in the next you assign it importance. No matter what you use for approximation things will get weird if the first letter is WRONG as a few of the mid_fuzzy results show. Also note that if you go for a 2 digit fuzzy match as shown then you will (on average) pull 1/10*10 ie 1% of your DB every time for mid_fuzzy and 1/26*10 ie 0.4% or your DB with the 2 char bit fuzzy. If you only have 1000 records this is probably not a problem. It is a problem if you have 100,000 - 1,000,000 odd records as a list of 1000-10,000 possibilities is a bit overwhelming. If you want a fuzzy search you would typically have a fuzzometer to let the client make the search progressively more fuzzy the more desperate they get to find whatever!

Combining two concurrent fuzzy searches is a potent technique. Even if each search pulls 1% of the DB the union will be much smaller so you will typically reduce the result set by 1,2,3 orders of magnitude. Fuzzy last name and initial would reduce the result set by a factor of roughly 26 (more like 20 but you get the idea).

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print


In reply to Re: Closest matches from string array by tachyon
in thread Closest matches from string array by Baz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.