samtregar has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I'm using String::Approx to find the best match for a given string against a list of candidates. Following the module's POD, here's my code:

# get best match by getting relative distances and # taking the lowest value my %dist; @dist{@input} = map { abs } adistr($target, @input); my ($best) = sort { $dist{$a} <=> $dist{$b} } @input; # re-check match to make sure it's good enough if ($best and amatch($target, $best)) { # got a match... } else { # no such luck. }

My problem is that String::Approx isn't picking the best match in many cases. For example, given:

$target = 'source'; @inputs = qw(donor_dedupe_source indicted_video_source source1);

the code picks "donor_dedupe_source" instead of "source1". That's because String::Approx considers them to be equally perfect matches, giving them both a adistr score of 0.

What's the best way to solve this? I could do another pass over scores with equal values sorting by closeness to length($target), but that seems gross.

-sam

Replies are listed 'Best First'.
Re: Tuning an approximate match to prefer closer lengths
by blokhead (Monsignor) on Nov 09, 2005 at 01:05 UTC
    The Levenshtein edit-distance algorithm (not sure if that's what String::Approx uses) can be parametrized, so you can give different penalties for character insertions, deletions, and discrepancies (instead of all penalties equal to 1 as in its usual usage). In fact, it looks like someone has already done this with Text::WagnerFischer.

    Just give insertions & deletions a (much) higher penalty than character discrepancies. The modified Levenshtein algorithm will give the optimal distance based on those penalties, and you can find the closest among a list of candidates.

    blokhead

Re: Tuning an approximate match to prefer closer lengths
by BrowserUk (Patriarch) on Nov 08, 2005 at 23:49 UTC

    If you pad the length of the shorter term to match that of the longer, you get more useful numbers

    perl> print "$_ : ", adistr( pack( 'A'.length, 'source' ), $_ ) for @i +nputs;; donor_dedupe_source : 13 indicted_video_source : 15 source1 : 1

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      That looks fishy to me. String::Approx::adistr returns the relative distance of a match, returning a value between 0 and 1. First, your values are outside that range. Second, nothing in String::Approx's docs leads me to believe that values from two separate runs of adistr() can be reliably compared.

      When I tried it on my data-set it caused more errors than the previous method, although it did nail "source"!

      -sam

        nothing in String::Approx's docs leads me to believe that values from two separate runs of adistr() can be reliably compared.

        According to the docs, the

        The measure of approximateness is the Levenshtein edit distance. It is the total number of "edits": insertions, deletions and substitutions required to transform a string to another string. For example, to transform "lead" into "gold", you need three edits: The edit distance of "lead" and "gold" is therefore three, or 75%.

        Which means that by padding the shorter value, you are getting the Levenshtein distances, converted to a ratio. Provided that one of each pair is the same string, the values should be comparable.

        This is the output from the tests I ran, with an additional check of the version:

        P:\test>p1 perl> use String::Approx qw[ adist ];; perl> @inputs = qw(donor_dedupe_source indicted_video_source source1); +; perl> print "$_ : ", adist( pack( 'A'.length, 'source' ), $_ ) for @in +puts;; donor_dedupe_source : 13 indicted_video_source : 15 source1 : 1 perl> Terminating on signal SIGINT(2) P:\test>p1 perl> use String::Approx qw[ adistr ];; perl> @inputs = qw(donor_dedupe_source indicted_video_source source1); +; perl> print "$_ : ", adistr( pack( 'A'.length, 'source' ), $_ ) for @i +nputs;; donor_dedupe_source : 13 indicted_video_source : 15 source1 : 1 perl> print String::Approx::VERSION;; print() on unopened filehandle VERSION at (eval 6) line 1, <STDIN> lin +e 4. perl> print $String::Approx::VERSION;; 3.19

        On the basis that I have seen an occasional quirk of running stuff via my perl shell, I've just run the following script:

        #! perl -slw use strict; use String::Approx qw[ adist adistr ]; print "String::Approx::VERSION: ", $String::Approx::VERSION; my @inputs = qw(donor_dedupe_source indicted_video_source source1); print "\nUsing adist"; print "$_ : ", adist( pack( 'A'.length, 'source' ), $_ ) for @inputs; print "\nUsing adistr"; print "$_ : ", adistr( pack( 'A'.length, 'source' ), $_ ) for @inputs; __END__ P:\test>junk2 String::Approx::VERSION: 3.19 Using adist donor_dedupe_source : 13 indicted_video_source : 15 source1 : 1 Using adistr donor_dedupe_source : 13 indicted_video_source : 15 source1 : 1

        Which shows no difference.

        Now I will re-install String::Approx from PPM and try again, and then I get the cpan copy and build that.

        Update: It would appear that adistr() changed between versions 3.19 that I've had forever and and 3.25:

        P:\test>ppm PPM - Programmer's Package Manager version 3.2. Copyright (c) 2001 ActiveState Corp. All Rights Reserved. ActiveState is a division of Sophos. Entering interactive shell. Using Term::ReadLine::Perl as readline lib +rary. Type 'help' to get started. ppm> search String::APprox Searching in Active Repositories 1. String-Approx [3.25] Jarkko Hietaniemi (jhi@iki.fi) ActiveState P +PM2 Repository [http://ppm 2. String-Approx [3.25] ActiveState P +ackage Repository [http:// ppm> install 1 Package 1: Note: Package 'String-Approx' is already installed. ==================== Install 'String-Approx' version 3.25 in ActivePerl 5.8.3.809. ==================== Downloaded 22757 bytes. Extracting 9/9: blib/arch/auto/String/Approx/Approx.lib Installing C:\Perl\site\lib\auto\String\Approx\Approx.dll Installing C:\Perl\site\lib\auto\String\Approx\Approx.exp Installing C:\Perl\site\lib\auto\String\Approx\Approx.lib Installing C:\Perl\html\site\lib\String\Approx.html Files found in blib\arch: installing files in blib\lib into architectu +re dependent library tree Installing C:\Perl\site\lib\String\Approx.pm Successfully installed String-Approx version 3.25 in ActivePerl 5.8.3. +809. ppm> quit P:\test>junk2 String::Approx::VERSION: 3.25 Using adist donor_dedupe_source : 13 indicted_video_source : 15 source1 : 1 Using adistr donor_dedupe_source : 0.684210526315789 indicted_video_source : 0.714285714285714 source1 : 0.142857142857143

        One anomaly that shows up is that PPM reports as installing in "ActivePerl 5.8.3.809.", which I haven't used (or even had installed) for many months? I think this must be from a registry setting that hasn't been properly cleaned/overwritten, but I need to investigate that.

        The version of perl used is 5.8.7.813:

        P:\test>perl -v This is perl, v5.8.7 built for MSWin32-x86-multi-thread (with 7 registered patches, see perl -V for more detail) Copyright 1987-2005, Larry Wall Binary build 813 [148120] provided by ActiveState http://www.ActiveSta +te.com ActiveState is a division of Sophos. Built Jun 6 2005 13:36:37 Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using `man perl' or `perldoc perl'. If you have access to + the Internet, point your browser at http://www.perl.org/, the Perl Home Pa +ge.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Tuning an approximate match to prefer closer lengths
by revdiablo (Prior) on Nov 09, 2005 at 00:01 UTC

    In this case, you could add an additional criteria to compare the length of the strings. The one with the length closest to your target string would be 'source1'. I'm not sure if the idea will apply generally, but it's worth investigation.

    Update: just noticed you mentioned the same exact thing in your question. Silly me for not reading carefully enough. Sorry about that. :-)

    I don't think it would be too particularly "gross", though, to have two passes. After all, this is a heuristic, right? Heuristics are usually messy.

      Yes, perhaps you're right. I implemented this approach and it seemed to do the trick.

      -sam

Re: Tuning an approximate match to prefer closer lengths
by bageler (Hermit) on Nov 09, 2005 at 16:16 UTC
    using Text::Levenshtein's distance function, I get source1 as the best match.
    use Text::Levenshtein qw(distance); $target = 'source'; $best=undef; for (qw(donor_dedupe_source indicted_video_source source1)) { $dist = distance($target,$_); if ($best == undef || $dist < $best) { $best = $dist; $bestword = $_; } } print "Best word was $bestword with an edit distance of $best\n";
    output: Best word was source1 with an edit distance of 1

    update: changed default value of $best to prevent perfect matches from being lost