Searching for best match

Sosi has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to search some arrays for the best possible matches. For instance, given the @source and @search lists:

my @source = ("John Ronald Reuel Tolkien","John Ronald S Tolkien","Tre
+nt Reznor","Barack Hussein Obama II","Barack Hussein II"); #note that
+ the second item is wrong and should be discarded!
my @search = ("John Ronald Reuel T","Trent Reznor","Barack Hussein II"
+,"Barack Hussein Obama II","No match here");
[download]

I would like to associate the @search list with the best match in @source list. I'm figuring this can be done with a search pattern with several ORs, but I'm stuck. Please see my example below:

#!/usr/local/bin/perl
use strict;
use warnings;

my @source = ("John Ronald Reuel Tolkien","John Ronald S Tolkien","Tre
+nt Reznor","Barack Hussein Obama II","Barack Hussein II");
my @search = ("John Ronald Reuel T","Trent Reznor","Barack Hussein II"
+,"Barack Hussein Obama II","No match here");

print "twonames\t\talternativesearch\n";
foreach my $s (@search){
    #gets first two names
    (my $twonames=$s)=~s/^(\w+ \w+).*$/$1/;

    #gets all other names, if they exist
    (my $others=$s)=~s/^(\w+ \w+)//;
    if ($others){
        #deletes initial space
        (my $alternativesearch=$others)=~s/^\s//;
        $alternativesearch=~s/\s/\|/g;
        print "$twonames\t\t$alternativesearch\n";
    }
    else {
        print "$twonames\t\tNO OTHER NAMES PRESENT\n";
    }
}
    #prints
    # twonames                alternativesearch
    # John Ronald             Reuel|T
    # Trent Reznor            NO OTHER NAMES PRESENT
    # Barack Hussein          II
    # Barack Hussein          Obama|II
[download]

In this search I would like to have an association between @search items and @source items that would yield the best match. Something like:

# search                    source
# John Ronald Reuel T       John Ronald Reuel Tolkien
# Trent Reznor              Trent Reznor
# Barack Hussein Obama II   Barack Hussein Obama II
# No match here
[download]

Note that, in the case of Obama it matched the whole array, in the first line it matched the two first words plus something else, and in the last case it found nothing. How would you proceed to find the best match? (i.e. the match where the search sentence would be mostly contained in the source sentence?) Thanks

Edit: this was crossposted on StackOverflow

Edit2: Even though I used people's names in my example, my real case has no people's names in case that matters.

Comment on Searching for best match Select or Download Code

Replies are listed 'Best First'.

Re: Searching for best match
by Athanasius (Archbishop) on Oct 06, 2014 at 13:21 UTC

Hello Sosi,

In addition to Text::Fuzzy, you should look at Algorithm::Diff. Here’s some proof-of-concept code:

#! perl
use strict;
use warnings;
use Algorithm::Diff qw( LCS );
use Data::Dump;
use constant MIN_MATCH => 0.5;

my @source =
(
    'John Ronald Reuel Tolkien',
    'John Ronald S Tolkien',
    'Trent Reznor',
    'Barack Hussein Obama II',
    'Barack Hussein II',
);

my @search =
(
    'John Ronald Reuel T',
    'Trent Reznor',
    'Barack Hussein II',
    'Barack Hussein Obama II',
    'No match here',
);

my %searches = map { $_ => [] } @search;

for my $s (@search)
{
    my @search_chars = split //, $s;
    my @matches;

    for my $source (@source)
    {
        my @source_chars = split //, $source;
        my @diff_chars   = LCS(\@search_chars, \@source_chars);
        my $diff         = join '', @diff_chars;

        if (!@{ $searches{$s} } ||
            length $diff > length $searches{$s}->[0])
        {
            $searches{$s}->[0] = $diff;
            $searches{$s}->[1] = $source;
        }
    }
}

for my $key (keys %searches)
{
    my $len_key   = length $key;
    my $len_match = length $searches{$key}->[0];

    delete $searches{$key} if ($len_match / $len_key) < MIN_MATCH;
}

dd \%searches;
[download]

Output:

23:19 >perl 1044_SoPW.pl
{
  "Barack Hussein II"       => ["Barack Hussein II", "Barack Hussein O
+bama II"],
  "Barack Hussein Obama II" => ["Barack Hussein Obama II", "Barack Hus
+sein Obama II"],
  "John Ronald Reuel T"     => ["John Ronald Reuel T", "John Ronald Re
+uel Tolkien"],
  "Trent Reznor"            => ["Trent Reznor", "Trent Reznor"],
}

23:19 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Searching for best match
by Eily (Monsignor) on Oct 06, 2014 at 12:30 UTC

Wow, looks like perl is too clever for its own good. You feed it the string "Reuel T" and it magically translates it to "R Tolkien". Now that's DWIM taken too far!

Beside that inconsistency between your input and the output you say you have, your exemple does not illustrate well what you meant by "best match", because in all cases the search string is fully contained in one of the sources. ("John Ronald Reuel T" and "<John Ronald Reuel T>olkien"). If your original search string was indeed "John Ronald R Tolkien", I suppose you'd still want to match "John Ronald Reuel Tolkien"?

And FYI, you can do all this:

 #gets first two names
(my $twonames=$s)=~s/^(\w+ \w+).*$/$1/;

#gets all other names, if they exist
(my $others=$s)=~s/^(\w+ \w+)//;

#deletes initial space
(my $alternativesearch=$others)=~s/^\s//;
[download]

Quote and Quote like Operators

my ($twonames, $others) = /^(\w+ \w+)(?: (.+))?/;

[reply]
[d/l]
[select]

Re^2: Searching for best match

by Sosi (Sexton) on Oct 06, 2014 at 12:54 UTC

eh I missed that upon editing my post. Thanks, I edited it accordingly.

Regarding the "best" match, I am looking for the match that differs less. This would be the match that would contain the most characters. I am now looking into the smallest distance between sentences using Text::Fuzzy, but I'll have a look at those modules that Anonymous Monk posted above. Also, thanks for your tip on the one line assignment. I guess this just shows how "green" I am :( Thank you!

[reply]

Re: Searching for best match
by Anonymous Monk on Oct 06, 2014 at 12:20 UTC

How would you proceed to find the best match?

That depends somewhat on your definition of "best", and some more explanation and examples would help, but I'm going to guess it's the longest one?

for my $s (@search) {
  my @names = split ' ', $s;
  # this reqires at least first two names to match
  my @matches = grep { /^\Q$names[0]\E\s+\Q$names[1]\E\b/ } @source;
  @matches = sort {length($b)<=>length($a)} @matches;
  print "search='$s'\n";
  print "\tfound='$_'\n" for @matches;
}
__END__
search='John Ronald Reuel T'
    found='John Ronald Reuel Tolkien'
    found='John Ronald S Tolkien'
search='Trent Reznor'
    found='Trent Reznor'
search='Barack Hussein II'
    found='Barack Hussein Obama II'
    found='Barack Hussein II'
search='Barack Hussein Obama II'
    found='Barack Hussein Obama II'
    found='Barack Hussein II'
search='No match here'
[download]

Just a note, using \w+ to match a name may not be enough, since it might not include all the characters you would consider part of a name (for example, in ASCII it doesn't include the dot, as in "Jr." or "Sr."). That's why the code above takes the alternative approach of splitting on whitespace. However, even that might not be enough, and you should probably look into the Lingua:: namespace on CPAN. For example, a quick search brings up Lingua::EN::MatchNames and Lingua::EN::NameParse.

[reply]
[d/l]
[select]

Re^2: Searching for best match

by Sosi (Sexton) on Oct 06, 2014 at 12:57 UTC

One more thing: even though I used people's names in my example, my real case has no real people names (I'm working with organisms' species in case you're interested), so I can't use Lingua::. It's my fault I chose the wrong example, I'm sorry for that.

[reply]

Re^2: Searching for best match

by Sosi (Sexton) on Oct 06, 2014 at 12:47 UTC

Thank you! Yes the best match is the longest one. In the Stackoverflow post someone suggested that I looked into fuzzy-matching modules. I'm also looking into this.

[reply]

Re^3: Searching for best match

by Anonymous Monk on Oct 06, 2014 at 15:39 UTC

It's a little unclear to me if Text::Fuzzy does what you want, but of course investigating CPAN modules is a good idea.

Also, just a note that the code above is only an interpretation if what your original code appears to want to do, i.e. looking at only the first two names for matches.

A more complete selection of sample input, description of what you want the match to be, and sample output would really help, I think.

[reply]

Re: Searching for best match
by choroba (Cardinal) on Oct 06, 2014 at 12:13 UTC

StackOverflow

لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

[reply]

Re^2: Searching for best match

by Sosi (Sexton) on Oct 06, 2014 at 12:40 UTC

I'm very sorry, you are right! I'll edit my post and do this in the future

[reply]

Re: Searching for best match
by Laurent_R (Canon) on Oct 06, 2014 at 22:20 UTC

Mastering Algorithms with Perl

It might be out of print, but I did not have too much trouble finding it second hand on the Internet a couple of years ago, or so. And it is a really "nice to read" and "nice to have" book. It definitely helped me several times when I had to tackle a subject really new to me (or long forgotten).

You might prefer The art of Computer Programming, the bible of programming by Donald Knuth, which I am also using relatively regularly (probably being a bit of a masochistic), or the famous Cormen et al. textbook, but the good thing about Mastering Algorithms with Perl is that it is usually easier to grasp and is directly related to our (well at least my) favorite language.

[reply]