Sosi has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to search some arrays for the best possible matches. For instance, given the @source and @search lists:

my @source = ("John Ronald Reuel Tolkien","John Ronald S Tolkien","Tre +nt Reznor","Barack Hussein Obama II","Barack Hussein II"); #note that + the second item is wrong and should be discarded! my @search = ("John Ronald Reuel T","Trent Reznor","Barack Hussein II" +,"Barack Hussein Obama II","No match here");

I would like to associate the @search list with the best match in @source list. I'm figuring this can be done with a search pattern with several ORs, but I'm stuck. Please see my example below:

#!/usr/local/bin/perl use strict; use warnings; my @source = ("John Ronald Reuel Tolkien","John Ronald S Tolkien","Tre +nt Reznor","Barack Hussein Obama II","Barack Hussein II"); my @search = ("John Ronald Reuel T","Trent Reznor","Barack Hussein II" +,"Barack Hussein Obama II","No match here"); print "twonames\t\talternativesearch\n"; foreach my $s (@search){ #gets first two names (my $twonames=$s)=~s/^(\w+ \w+).*$/$1/; #gets all other names, if they exist (my $others=$s)=~s/^(\w+ \w+)//; if ($others){ #deletes initial space (my $alternativesearch=$others)=~s/^\s//; $alternativesearch=~s/\s/\|/g; print "$twonames\t\t$alternativesearch\n"; } else { print "$twonames\t\tNO OTHER NAMES PRESENT\n"; } } #prints # twonames alternativesearch # John Ronald Reuel|T # Trent Reznor NO OTHER NAMES PRESENT # Barack Hussein II # Barack Hussein Obama|II

In this search I would like to have an association between @search items and @source items that would yield the best match. Something like:

# search source # John Ronald Reuel T John Ronald Reuel Tolkien # Trent Reznor Trent Reznor # Barack Hussein Obama II Barack Hussein Obama II # No match here

Note that, in the case of Obama it matched the whole array, in the first line it matched the two first words plus something else, and in the last case it found nothing. How would you proceed to find the best match? (i.e. the match where the search sentence would be mostly contained in the source sentence?) Thanks

Edit: this was crossposted on StackOverflow

Edit2: Even though I used people's names in my example, my real case has no people's names in case that matters.

Replies are listed 'Best First'.
Re: Searching for best match
by Athanasius (Archbishop) on Oct 06, 2014 at 13:21 UTC

    Hello Sosi,

    In addition to Text::Fuzzy, you should look at Algorithm::Diff. Here’s some proof-of-concept code:

    #! perl use strict; use warnings; use Algorithm::Diff qw( LCS ); use Data::Dump; use constant MIN_MATCH => 0.5; my @source = ( 'John Ronald Reuel Tolkien', 'John Ronald S Tolkien', 'Trent Reznor', 'Barack Hussein Obama II', 'Barack Hussein II', ); my @search = ( 'John Ronald Reuel T', 'Trent Reznor', 'Barack Hussein II', 'Barack Hussein Obama II', 'No match here', ); my %searches = map { $_ => [] } @search; for my $s (@search) { my @search_chars = split //, $s; my @matches; for my $source (@source) { my @source_chars = split //, $source; my @diff_chars = LCS(\@search_chars, \@source_chars); my $diff = join '', @diff_chars; if (!@{ $searches{$s} } || length $diff > length $searches{$s}->[0]) { $searches{$s}->[0] = $diff; $searches{$s}->[1] = $source; } } } for my $key (keys %searches) { my $len_key = length $key; my $len_match = length $searches{$key}->[0]; delete $searches{$key} if ($len_match / $len_key) < MIN_MATCH; } dd \%searches;

    Output:

    23:19 >perl 1044_SoPW.pl { "Barack Hussein II" => ["Barack Hussein II", "Barack Hussein O +bama II"], "Barack Hussein Obama II" => ["Barack Hussein Obama II", "Barack Hus +sein Obama II"], "John Ronald Reuel T" => ["John Ronald Reuel T", "John Ronald Re +uel Tolkien"], "Trent Reznor" => ["Trent Reznor", "Trent Reznor"], } 23:19 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Searching for best match
by Eily (Monsignor) on Oct 06, 2014 at 12:30 UTC

    Wow, looks like perl is too clever for its own good. You feed it the string "Reuel T" and it magically translates it to "R Tolkien". Now that's DWIM taken too far!

    Beside that inconsistency between your input and the output you say you have, your exemple does not illustrate well what you meant by "best match", because in all cases the search string is fully contained in one of the sources. ("John Ronald Reuel T" and "<John Ronald Reuel T>olkien"). If your original search string was indeed "John Ronald R Tolkien", I suppose you'd still want to match "John Ronald Reuel Tolkien"?

    And FYI, you can do all this:

    #gets first two names (my $twonames=$s)=~s/^(\w+ \w+).*$/$1/; #gets all other names, if they exist (my $others=$s)=~s/^(\w+ \w+)//; #deletes initial space (my $alternativesearch=$others)=~s/^\s//;
    in one line (see "matching in list context" in Quote and Quote like Operators): my ($twonames, $others) = /^(\w+ \w+)(?: (.+))?/;

      eh I missed that upon editing my post. Thanks, I edited it accordingly.

      Regarding the "best" match, I am looking for the match that differs less. This would be the match that would contain the most characters. I am now looking into the smallest distance between sentences using Text::Fuzzy, but I'll have a look at those modules that Anonymous Monk posted above. Also, thanks for your tip on the one line assignment. I guess this just shows how "green" I am :( Thank you!

Re: Searching for best match
by Anonymous Monk on Oct 06, 2014 at 12:20 UTC
    How would you proceed to find the best match?

    That depends somewhat on your definition of "best", and some more explanation and examples would help, but I'm going to guess it's the longest one?

    for my $s (@search) { my @names = split ' ', $s; # this reqires at least first two names to match my @matches = grep { /^\Q$names[0]\E\s+\Q$names[1]\E\b/ } @source; @matches = sort {length($b)<=>length($a)} @matches; print "search='$s'\n"; print "\tfound='$_'\n" for @matches; } __END__ search='John Ronald Reuel T' found='John Ronald Reuel Tolkien' found='John Ronald S Tolkien' search='Trent Reznor' found='Trent Reznor' search='Barack Hussein II' found='Barack Hussein Obama II' found='Barack Hussein II' search='Barack Hussein Obama II' found='Barack Hussein Obama II' found='Barack Hussein II' search='No match here'

    Just a note, using \w+ to match a name may not be enough, since it might not include all the characters you would consider part of a name (for example, in ASCII it doesn't include the dot, as in "Jr." or "Sr."). That's why the code above takes the alternative approach of splitting on whitespace. However, even that might not be enough, and you should probably look into the Lingua:: namespace on CPAN. For example, a quick search brings up Lingua::EN::MatchNames and Lingua::EN::NameParse.

      One more thing: even though I used people's names in my example, my real case has no real people names (I'm working with organisms' species in case you're interested), so I can't use Lingua::. It's my fault I chose the wrong example, I'm sorry for that.

      Thank you! Yes the best match is the longest one. In the Stackoverflow post someone suggested that I looked into fuzzy-matching modules. I'm also looking into this.

        It's a little unclear to me if Text::Fuzzy does what you want, but of course investigating CPAN modules is a good idea.

        Also, just a note that the code above is only an interpretation if what your original code appears to want to do, i.e. looking at only the first two names for matches.

        A more complete selection of sample input, description of what you want the match to be, and sample output would really help, I think.

Re: Searching for best match
by choroba (Cardinal) on Oct 06, 2014 at 12:13 UTC
    Crossposted at StackOverflow. It's considered polite to inform about crossposting so people not attending both sites don't waste their time hacking a solution to a problem already solved at the other end of the Internet.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      I'm very sorry, you are right! I'll edit my post and do this in the future
Re: Searching for best match
by Laurent_R (Canon) on Oct 06, 2014 at 22:20 UTC
    You might want to take a look at this very good book: Mastering Algorithms with Perl, by Jon Orwant, Jarkko Hietaniemi and John Macdonald (O'Reilly). Chapter 9 ("Strings") has quite a bit of things about fuzzy matching and other interesting things. The good thing about it is that it gives you not only Perl solutions (such as using the XYZ module), but also some solid CS theory behind, with important known algorithms, etc.

    It might be out of print, but I did not have too much trouble finding it second hand on the Internet a couple of years ago, or so. And it is a really "nice to read" and "nice to have" book. It definitely helped me several times when I had to tackle a subject really new to me (or long forgotten).

    You might prefer The art of Computer Programming, the bible of programming by Donald Knuth, which I am also using relatively regularly (probably being a bit of a masochistic), or the famous Cormen et al. textbook, but the good thing about Mastering Algorithms with Perl is that it is usually easier to grasp and is directly related to our (well at least my) favorite language.