hsmyers has asked for the wisdom of the Perl Monks concerning the following question:

For reasons that are beyond me, Amazon's web API sees nothing wrong with returing 'Vic Broquard', 'Broquard Vic' and 'Victor E. Broquard' all as authors for the same book! Needless to say this is at best a pain but for the moment, a fact of life. So I was thinking about how to solve this small quandry and I've come up with the following as a kind of heuristic approach.

#!/perl/bin/perl # # test.pl -- use strict; use warnings; use diagnostics; use String::Similarity; my @authors1 = ( 'Vic Broquard', 'Broquard Vic', 'Victor E. Broquard', ); my @authors2 = ( 'Peter Prinz', 'Ulla Kirch-Prinz', ); my @authors3 = ( 'Larry Wall', 'Tom Christiansen', 'Jon Orwant', ); print "Average Similarity for \@authors1 = ",check_similarity(@authors +1),"\n"; print "Average Similarity for \@authors2 = ",check_similarity(@authors +2),"\n"; print "Average Similarity for \@authors3 = ",check_similarity(@authors +3),"\n"; sub check_similarity { my ($count,$similarity_total); for my $ref (combinations(@_)) { if (scalar(@$ref) == 2) { $count++; $similarity_total += similarity (@{$ref}[0],@{$ref}[1]); } } return $similarity_total / $count; } sub combinations { return [] unless @_; my $first = shift; my @rest = combinations(@_); return @rest, map { [$first, @$_] } @rest; } C:>test Average Similarity for @authors1 = 0.666666666666667 Average Similarity for @authors2 = 0.444444444444444 Average Similarity for @authors3 = 0.246153846153846
Where the idea is to do something different about those instances where the check value is say greater than .5---in this case assume that the names in the array are variations on each other and pick the longest one as the best guess. Or select an alternate lookup or some other such approach. Anyone with suggestions,improvements etc., chime in here please...thanks!

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."

Replies are listed 'Best First'.
Re: Perls before Amazon
by benn (Vicar) on Jun 08, 2003 at 19:15 UTC
    As ever, there's a lovely little module on CPAN that does this - Lingua::EN::MatchNames. It uses Lingua::EN::NameParse, Lingua::EN::Nickname etc. to split up the name, check for variants etc., and returns a (fairly arbitrary, but seems pretty accurate) percentage match. Very handy for cleaning up contacts databases...

    Cheers,Ben.

      Thanks! I've just installed
      • Lingua::EN::NickName
      • Lingua::EN::MatchNames
      • Lingua::EN::NameCase
      • Lingua::EN::NameParse
      • String::Similarity
      but for some reason picked String::Similarity as the place to start---oh well, twas ever thus!

      --hsm

      "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: Perls before Amazon
by LAI (Hermit) on Jun 09, 2003 at 19:09 UTC
    ... assume that the names in the array are variations on each other and pick the longest one as the best guess. Or select an alternate lookup or some other such approach.
    <chime>

    The first thing I thought of here was Google's approach, where they tell you:

    In order to show you the most relevant results, we have omitted some entries very similar to the 35 already displayed. If you like, you can repeat the search with the omitted results included.

    That way no data is lost, it is all available to the user, but the interface stays pretty and relatively concise.

    </chime>

    LAI

    __END__