Guessing/Ordering Partial Data

ropey has asked for the wisdom of the Perl Monks concerning the following question:

Hi All

I have a part of an application, where from a given (partial) user input a best guess should be made to match what the user is searching for.

In this example its in terms of locations, so the user wants to look for a given sub set of locations in say Paris, given that they have entered Paris, and optional search terms of 'Place De La Gare' and 'Rennes'. From the Paris part the application can simply find all Paris entries, lets say its the following subset of data

Place De La Gare - Angers
Place De La Gare - Nevers
Place Mohammed V - Oujda
Place De La Gare - Rennes
Place de la Gare - Quimper
Place Thiers - Nancy
Place De La Gare - Grenoble
Place Du Chateau - Galerie Marchande Du Rer
Place De La Gare - Angers
Place De La Gare 1 - Grenoble
Place De La Gare - Nevers
Place De La Gare - Rennes
Place De La Gare
Place de la Gare
Place de la Gare - Bergerac
Place de la Gare - Moutiers
Place de la Gare - Libourne

A simple script to *guess* from the subset is as follows

#!/usr/bin/perl -w

use strict;
use Data::Dumper;

my @data = (
'Place De La Gare - Angers',
'Place De La Gare - Nevers',
'Place Mohammed V -  Oujda',
'Place De La Gare - Rennes',
'Place de la Gare - Quimper',
'Place Thiers -  Nancy',
'Place De La Gare -  Grenoble',
'Place Du Chateau - Galerie Marchande Du Rer',
'Place De La Gare -  Angers',
'Place De La Gare 1 - Bannes Grenoble',
'Place De La Gare -  Nevers',
'Place De La Gare -  Rennes',
'Place De La Gare bannes',
'Place de la Gare',
'Place de la Gare - Bergerac',
'Place de la Gare - Moutiers',
'Place de la Gare - Libourne'
        );

my @guesses = ('Place de la Gare', 'Rennes');

my @list = @data;
foreach my $guess(@guesses) {
    my @guessed = grep { /$guess/i } @list;
    # Now guess in reduced list
    @list = @guessed;
}
print "Guess this is what you wanted ? ", join "\n", @list;
[download]

Is there a better way, given this is a reduced subset of the real data (could be hundreds of possible choices) and multiple words to reduce the guessing ?

The next step would be to start weighting the guesses, in this case it will remove anything which is not a exact match to the terms, any *clean* ways of doing the weighting ?

Comment on Guessing/Ordering Partial Data Download Code

Replies are listed 'Best First'.
Re: Guessing/Ordering Partial Data by Zaxo (Archbishop) on Apr 13, 2005 at 03:48 UTC
Here's a pretty simple way to score the matching. Convert the array of search terms to a regex alternation and count the number of matches for each datum: `my @guesses = ('Place De La Gare', 'Rennes'); my $grei = do { # this case insensitive re added local $" = '\|'; qr/@{[map {quotemeta} @guesses]}/i; }; my $gre = do { local $" = '\|'; qr/@{[map {quotemeta} @guesses]}/; }; my %score; for (@data) { $score{$_} = () = m/($grei)/g; # edited. # $score{$_} += () = m/($gre)/g; # uncomment to give extra # credit for exact match } # sort keys by value or grep for threshold to pick best matches` [download] Update: Case insensitivity was being overruled by the compiled $gre. Repaired. Got rid of non-capture grouping and added quotemeta to defang special characters in the data. More: Added the '()=" trick to force array context - that fixes the counts. After Compline, Zaxo	[reply] [d/l]
Re^2: Guessing/Ordering Partial Data by ropey (Hermit) on Apr 13, 2005 at 04:11 UTC
I Like Zaxo Output after running and using Data::Dumper is $VAR1 = { 'Place de la Gare - Bergerac' => 1, 'Place De La Gare - Angers' => 0, 'Place Thiers - Nancy' => 0, 'Place De La Gare - Rennes' => 1, 'Place de la Gare' => 1, 'Place De La Gare - Nevers' => 0, 'Place De La Gare - Grenoble' => 0, 'Place De La Gare 1 - Grenoble' => 0, 'Place De La Gare - Angers' => 0, 'Place de la Gare - Libourne' => 1, 'Place de la Gare - Moutiers' => 1, 'Place Mohammed V - Oujda' => 0, 'Place De La Gare' => 0, 'Place Du Chateau - Galerie Marchande Du Rer' => 0, 'Place de la Gare - Quimper' => 1, 'Place De La Gare - Nevers' => 0, 'Place De La Gare - Rennes' => 1 }; [download] I would have though that the Rennes match would have a score of 2, it looks to me that its case sensitive (as re-running with the input as 'Place De La Gare' instead of 'Place de la Gare' scores as i would expect. I dont however get why this is seeing as the regex is using -i ?	[reply] [d/l]
Re: Guessing/Ordering Partial Data by toma (Vicar) on Apr 13, 2005 at 06:57 UTC
This is a similar problem to Some kind of fuzzy logic, and should have a similar solution. If you go with the n-tuple approach, you get a weighting for each match that can be used to order the result. It should work perfectly the first time! - toma	[reply]
Re: Guessing/Ordering Partial Data by johnnywang (Priest) on Apr 13, 2005 at 06:45 UTC
I would treat it as a typical search problem: first build a reverse index, and then just do a look up, you can order by how many are matched. One nice thing about this approach is that you can build the reverse index offline, so scalable as your @data gets bigger. (Sorry, too tired/late to give code, but it should be easy.)	[reply]
Re: Guessing/Ordering Partial Data by bageler (Hermit) on Apr 13, 2005 at 13:55 UTC
I have a similar problem with some project management software, where I allow people to enter partial names and I do fuzzy work to pick the right one using Text::Levenshtein and Text::Soundex.	[reply]
Re: Guessing/Ordering Partial Data by mattr (Curate) on Apr 14, 2005 at 07:29 UTC
I'd like to point you somewhere and then offer my own swing at this. One approach is to make a reverse index. You might like to check out an article that's an old favorite of mine on Building a Vector Space Search Engine in Perl. Also Lingua::Stem::Fr may help improve accuracy. Also you can use the above article's suggestion of keeping a bad words list and remove de, la, du, etc. from your dictionary. But in your guesses you seem to want to do phrase matching, and this is not directly supported. There are more sophisticated algorithms but if you want phrases I'd say the brute force with grepping and keeping track of hits is best for this case, it is not so difficult algorithmically and for only a hundred items it will not be slow if you only loop through once for each word. Note a hash key can have spaces in it. That said, here is my shot at it. My strategy was simple, and has the added attraction of keeping score, only showing the highest scoring hits, and allowing you to search for phrases. (at least it seems to work that way so far). If you want to use the command line, take a look at @ARGV. #!/cygdrive/c/Perl/bin/perl # http://www.perlmonks.org/?node_id=447234 my @loc = (); my $x; while (<DATA>) { lc; chomp; push (@loc,$_); } #print "Available locations:\n" . join("\n", sort @loc); my %score = (); #my @phrases = ("Place de la Gare", "Rennes"); my @phrases = ("gare","er","n"); my $phrase; foreach $phrase (@phrases) { my @matches = grep(/$phrase/i, @loc); foreach my $match (@matches) { $score{$match}++; } } my $hiscore = 0; foreach my $hit (keys %score) { my $s = $score{$hit}; $hiscore = $s if $s > $hiscore; push (@{$hits[$s]},$hit); } # just print highest scoring ones print "Top scoring matches with a score of $hiscore:\n"; foreach my $toploc (@{$hits[$hiscore]}) { print "$toploc\n"; } __DATA__ Place De La Gare - Angers Place De La Gare - Nevers Place Mohammed V - Oujda Place De La Gare - Rennes Place de la Gare - Quimper Place Thiers - Nancy Place De La Gare - Grenoble Place Du Chateau - Galerie Marchande Du Rer Place De La Gare - Angers Place De La Gare 1 - Bannes Grenoble Place De La Gare - Nevers Place De La Gare - Rennes Place De La Gare bannes Place de la Gare Place de la Gare - Bergerac Place de la Gare - Moutiers Place de la Gare - Libourne [download]	[reply] [d/l]
Re: Guessing/Ordering Partial Data by johnnywang (Priest) on Apr 13, 2005 at 17:13 UTC
Another approach is something similar to Google Suggests http://www.google.com/webhp?complete=1&hl=en, it gives the user immediate feedback, but I'm not sure how fast your server has to be.	[reply]