Perls before Amazon

hsmyers has asked for the wisdom of the Perl Monks concerning the following question:

For reasons that are beyond me, Amazon's web API sees nothing wrong with returing 'Vic Broquard', 'Broquard Vic' and 'Victor E. Broquard' all as authors for the same book! Needless to say this is at best a pain but for the moment, a fact of life. So I was thinking about how to solve this small quandry and I've come up with the following as a kind of heuristic approach.

#!/perl/bin/perl
#
# test.pl --
use strict;
use warnings;
use diagnostics;
use String::Similarity;

my @authors1 = (
'Vic Broquard',
'Broquard Vic',
'Victor E. Broquard',
);

my @authors2 = (
'Peter Prinz',
'Ulla Kirch-Prinz',
);

my @authors3 = (
'Larry Wall',
'Tom Christiansen',
'Jon Orwant',
);

print "Average Similarity for \@authors1 = ",check_similarity(@authors
+1),"\n";
print "Average Similarity for \@authors2 = ",check_similarity(@authors
+2),"\n";
print "Average Similarity for \@authors3 = ",check_similarity(@authors
+3),"\n";

sub check_similarity {
    my ($count,$similarity_total);
    for my $ref (combinations(@_)) {
        if (scalar(@$ref) == 2) {
            $count++;
            $similarity_total += similarity (@{$ref}[0],@{$ref}[1]);
        }
    }
    return $similarity_total / $count;

}

sub combinations {
  return [] unless @_;
  my $first = shift;
  my @rest = combinations(@_);
  return @rest, map { [$first, @$_] } @rest;
}

C:>test
Average Similarity for @authors1 = 0.666666666666667
Average Similarity for @authors2 = 0.444444444444444
Average Similarity for @authors3 = 0.246153846153846
[download]

Where the idea is to do something different about those instances where the check value is say greater than .5---in this case assume that the names in the array are variations on each other and pick the longest one as the best guess. Or select an alternate lookup or some other such approach. Anyone with suggestions,improvements etc., chime in here please...thanks!

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."

Comment on Perls before Amazon Download Code

Replies are listed 'Best First'.
Re: Perls before Amazon by benn (Vicar) on Jun 08, 2003 at 19:15 UTC
As ever, there's a lovely little module on CPAN that does this - Lingua::EN::MatchNames. It uses Lingua::EN::NameParse, Lingua::EN::Nickname etc. to split up the name, check for variants etc., and returns a (fairly arbitrary, but seems pretty accurate) percentage match. Very handy for cleaning up contacts databases... Cheers,Ben.	[reply]
Re: Re: Perls before Amazon by hsmyers (Canon) on Jun 08, 2003 at 22:33 UTC
Thanks! I've just installed Lingua::EN::NickName Lingua::EN::MatchNames Lingua::EN::NameCase Lingua::EN::NameParse String::Similarity but for some reason picked String::Similarity as the place to start---oh well, twas ever thus! --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply]
Re: Perls before Amazon by LAI (Hermit) on Jun 09, 2003 at 19:09 UTC
... assume that the names in the array are variations on each other and pick the longest one as the best guess. Or select an alternate lookup or some other such approach. `<chime>` The first thing I thought of here was Google's approach, where they tell you: In order to show you the most relevant results, we have omitted some entries very similar to the 35 already displayed. If you like, you can repeat the search with the omitted results included. That way no data is lost, it is all available to the user, but the interface stays pretty and relatively concise. `</chime>` LAI `__END__`	[reply] [d/l] [select]