comment on

I'd like to point you somewhere and then offer my own swing at this.

One approach is to make a reverse index. You might like to check out an article that's an old favorite of mine on Building a Vector Space Search Engine in Perl.

Also Lingua::Stem::Fr may help improve accuracy. Also you can use the above article's suggestion of keeping a bad words list and remove de, la, du, etc. from your dictionary.

But in your guesses you seem to want to do phrase matching, and this is not directly supported. There are more sophisticated algorithms but if you want phrases I'd say the brute force with grepping and keeping track of hits is best for this case, it is not so difficult algorithmically and for only a hundred items it will not be slow if you only loop through once for each word. Note a hash key can have spaces in it.

That said, here is my shot at it. My strategy was simple, and has the added attraction of keeping score, only showing the highest scoring hits, and allowing you to search for phrases. (at least it seems to work that way so far). If you want to use the command line, take a look at @ARGV.

#!/cygdrive/c/Perl/bin/perl

# http://www.perlmonks.org/?node_id=447234

my @loc = ();
my $x;
while (<DATA>) {
    lc; chomp;
    push (@loc,$_);
}

#print "Available locations:\n" . join("\n", sort @loc);

my %score = ();
#my @phrases = ("Place de la Gare", "Rennes");
my @phrases = ("gare","er","n");
my $phrase;

foreach $phrase (@phrases) {
    my @matches = grep(/$phrase/i, @loc);
    foreach my $match (@matches) {
    $score{$match}++;
    }
}

my $hiscore = 0;
foreach my $hit (keys %score) {
    my $s = $score{$hit};
    $hiscore = $s if $s > $hiscore;
    push (@{$hits[$s]},$hit);
}

# just print highest scoring ones

print "Top scoring matches with a score of $hiscore:\n";
foreach my $toploc (@{$hits[$hiscore]}) {
    print "$toploc\n";
}

__DATA__
Place De La Gare - Angers
Place De La Gare - Nevers
Place Mohammed V -  Oujda
Place De La Gare - Rennes
Place de la Gare - Quimper
Place Thiers -  Nancy
Place De La Gare -  Grenoble
Place Du Chateau - Galerie Marchande Du Rer
Place De La Gare -  Angers
Place De La Gare 1 - Bannes Grenoble
Place De La Gare -  Nevers
Place De La Gare -  Rennes
Place De La Gare bannes
Place de la Gare
Place de la Gare - Bergerac
Place de la Gare - Moutiers
Place de la Gare - Libourne
[download]

In reply to Re: Guessing/Ordering Partial Data by mattr
in thread Guessing/Ordering Partial Data by ropey

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.