For reasons that are beyond me, Amazon's web API sees nothing wrong with returing 'Vic Broquard', 'Broquard Vic' and 'Victor E. Broquard' all as authors for the same book! Needless to say this is at best a pain but for the moment, a fact of life. So I was thinking about how to solve this small quandry and I've come up with the following as a kind of heuristic approach.

#!/perl/bin/perl # # test.pl -- use strict; use warnings; use diagnostics; use String::Similarity; my @authors1 = ( 'Vic Broquard', 'Broquard Vic', 'Victor E. Broquard', ); my @authors2 = ( 'Peter Prinz', 'Ulla Kirch-Prinz', ); my @authors3 = ( 'Larry Wall', 'Tom Christiansen', 'Jon Orwant', ); print "Average Similarity for \@authors1 = ",check_similarity(@authors +1),"\n"; print "Average Similarity for \@authors2 = ",check_similarity(@authors +2),"\n"; print "Average Similarity for \@authors3 = ",check_similarity(@authors +3),"\n"; sub check_similarity { my ($count,$similarity_total); for my $ref (combinations(@_)) { if (scalar(@$ref) == 2) { $count++; $similarity_total += similarity (@{$ref}[0],@{$ref}[1]); } } return $similarity_total / $count; } sub combinations { return [] unless @_; my $first = shift; my @rest = combinations(@_); return @rest, map { [$first, @$_] } @rest; } C:>test Average Similarity for @authors1 = 0.666666666666667 Average Similarity for @authors2 = 0.444444444444444 Average Similarity for @authors3 = 0.246153846153846
Where the idea is to do something different about those instances where the check value is say greater than .5---in this case assume that the names in the array are variations on each other and pick the longest one as the best guess. Or select an alternate lookup or some other such approach. Anyone with suggestions,improvements etc., chime in here please...thanks!

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."

In reply to Perls before Amazon by hsmyers

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.