ropey has asked for the wisdom of the Perl Monks concerning the following question:

Hi All

I have a part of an application, where from a given (partial) user input a best guess should be made to match what the user is searching for.

In this example its in terms of locations, so the user wants to look for a given sub set of locations in say Paris, given that they have entered Paris, and optional search terms of 'Place De La Gare' and 'Rennes'. From the Paris part the application can simply find all Paris entries, lets say its the following subset of data

A simple script to *guess* from the subset is as follows

#!/usr/bin/perl -w use strict; use Data::Dumper; my @data = ( 'Place De La Gare - Angers', 'Place De La Gare - Nevers', 'Place Mohammed V - Oujda', 'Place De La Gare - Rennes', 'Place de la Gare - Quimper', 'Place Thiers - Nancy', 'Place De La Gare - Grenoble', 'Place Du Chateau - Galerie Marchande Du Rer', 'Place De La Gare - Angers', 'Place De La Gare 1 - Bannes Grenoble', 'Place De La Gare - Nevers', 'Place De La Gare - Rennes', 'Place De La Gare bannes', 'Place de la Gare', 'Place de la Gare - Bergerac', 'Place de la Gare - Moutiers', 'Place de la Gare - Libourne' ); my @guesses = ('Place de la Gare', 'Rennes'); my @list = @data; foreach my $guess(@guesses) { my @guessed = grep { /$guess/i } @list; # Now guess in reduced list @list = @guessed; } print "Guess this is what you wanted ? ", join "\n", @list;

Is there a better way, given this is a reduced subset of the real data (could be hundreds of possible choices) and multiple words to reduce the guessing ?

The next step would be to start weighting the guesses, in this case it will remove anything which is not a exact match to the terms, any *clean* ways of doing the weighting ?

Replies are listed 'Best First'.
Re: Guessing/Ordering Partial Data
by Zaxo (Archbishop) on Apr 13, 2005 at 03:48 UTC

    Here's a pretty simple way to score the matching. Convert the array of search terms to a regex alternation and count the number of matches for each datum:

    my @guesses = ('Place De La Gare', 'Rennes'); my $grei = do { # this case insensitive re added local $" = '|'; qr/@{[map {quotemeta} @guesses]}/i; }; my $gre = do { local $" = '|'; qr/@{[map {quotemeta} @guesses]}/; }; my %score; for (@data) { $score{$_} = () = m/($grei)/g; # edited. # $score{$_} += () = m/($gre)/g; # uncomment to give extra # credit for exact match } # sort keys by value or grep for threshold to pick best matches

    Update: Case insensitivity was being overruled by the compiled $gre. Repaired. Got rid of non-capture grouping and added quotemeta to defang special characters in the data. More: Added the '()=" trick to force array context - that fixes the counts.

    After Compline,
    Zaxo

      I Like Zaxo

      Output after running and using Data::Dumper is

      $VAR1 = { 'Place de la Gare - Bergerac' => 1, 'Place De La Gare - Angers' => 0, 'Place Thiers - Nancy' => 0, 'Place De La Gare - Rennes' => 1, 'Place de la Gare' => 1, 'Place De La Gare - Nevers' => 0, 'Place De La Gare - Grenoble' => 0, 'Place De La Gare 1 - Grenoble' => 0, 'Place De La Gare - Angers' => 0, 'Place de la Gare - Libourne' => 1, 'Place de la Gare - Moutiers' => 1, 'Place Mohammed V - Oujda' => 0, 'Place De La Gare' => 0, 'Place Du Chateau - Galerie Marchande Du Rer' => 0, 'Place de la Gare - Quimper' => 1, 'Place De La Gare - Nevers' => 0, 'Place De La Gare - Rennes' => 1 };

      I would have though that the Rennes match would have a score of 2, it looks to me that its case sensitive (as re-running with the input as 'Place De La Gare' instead of 'Place de la Gare' scores as i would expect. I dont however get why this is seeing as the regex is using -i ?

Re: Guessing/Ordering Partial Data
by toma (Vicar) on Apr 13, 2005 at 06:57 UTC
    This is a similar problem to Some kind of fuzzy logic, and should have a similar solution. If you go with the n-tuple approach, you get a weighting for each match that can be used to order the result.

    It should work perfectly the first time! - toma
Re: Guessing/Ordering Partial Data
by johnnywang (Priest) on Apr 13, 2005 at 06:45 UTC
    I would treat it as a typical search problem: first build a reverse index, and then just do a look up, you can order by how many are matched. One nice thing about this approach is that you can build the reverse index offline, so scalable as your @data gets bigger. (Sorry, too tired/late to give code, but it should be easy.)
Re: Guessing/Ordering Partial Data
by bageler (Hermit) on Apr 13, 2005 at 13:55 UTC
    I have a similar problem with some project management software, where I allow people to enter partial names and I do fuzzy work to pick the right one using Text::Levenshtein and Text::Soundex.
Re: Guessing/Ordering Partial Data
by mattr (Curate) on Apr 14, 2005 at 07:29 UTC
    I'd like to point you somewhere and then offer my own swing at this.

    One approach is to make a reverse index. You might like to check out an article that's an old favorite of mine on Building a Vector Space Search Engine in Perl.

    Also Lingua::Stem::Fr may help improve accuracy. Also you can use the above article's suggestion of keeping a bad words list and remove de, la, du, etc. from your dictionary.

    But in your guesses you seem to want to do phrase matching, and this is not directly supported. There are more sophisticated algorithms but if you want phrases I'd say the brute force with grepping and keeping track of hits is best for this case, it is not so difficult algorithmically and for only a hundred items it will not be slow if you only loop through once for each word. Note a hash key can have spaces in it.

    That said, here is my shot at it. My strategy was simple, and has the added attraction of keeping score, only showing the highest scoring hits, and allowing you to search for phrases. (at least it seems to work that way so far). If you want to use the command line, take a look at @ARGV.

    #!/cygdrive/c/Perl/bin/perl # http://www.perlmonks.org/?node_id=447234 my @loc = (); my $x; while (<DATA>) { lc; chomp; push (@loc,$_); } #print "Available locations:\n" . join("\n", sort @loc); my %score = (); #my @phrases = ("Place de la Gare", "Rennes"); my @phrases = ("gare","er","n"); my $phrase; foreach $phrase (@phrases) { my @matches = grep(/$phrase/i, @loc); foreach my $match (@matches) { $score{$match}++; } } my $hiscore = 0; foreach my $hit (keys %score) { my $s = $score{$hit}; $hiscore = $s if $s > $hiscore; push (@{$hits[$s]},$hit); } # just print highest scoring ones print "Top scoring matches with a score of $hiscore:\n"; foreach my $toploc (@{$hits[$hiscore]}) { print "$toploc\n"; } __DATA__ Place De La Gare - Angers Place De La Gare - Nevers Place Mohammed V - Oujda Place De La Gare - Rennes Place de la Gare - Quimper Place Thiers - Nancy Place De La Gare - Grenoble Place Du Chateau - Galerie Marchande Du Rer Place De La Gare - Angers Place De La Gare 1 - Bannes Grenoble Place De La Gare - Nevers Place De La Gare - Rennes Place De La Gare bannes Place de la Gare Place de la Gare - Bergerac Place de la Gare - Moutiers Place de la Gare - Libourne
Re: Guessing/Ordering Partial Data
by johnnywang (Priest) on Apr 13, 2005 at 17:13 UTC