I'd like to point you somewhere and then offer my own swing at this.

One approach is to make a reverse index. You might like to check out an article that's an old favorite of mine on Building a Vector Space Search Engine in Perl.

Also Lingua::Stem::Fr may help improve accuracy. Also you can use the above article's suggestion of keeping a bad words list and remove de, la, du, etc. from your dictionary.

But in your guesses you seem to want to do phrase matching, and this is not directly supported. There are more sophisticated algorithms but if you want phrases I'd say the brute force with grepping and keeping track of hits is best for this case, it is not so difficult algorithmically and for only a hundred items it will not be slow if you only loop through once for each word. Note a hash key can have spaces in it.

That said, here is my shot at it. My strategy was simple, and has the added attraction of keeping score, only showing the highest scoring hits, and allowing you to search for phrases. (at least it seems to work that way so far). If you want to use the command line, take a look at @ARGV.

#!/cygdrive/c/Perl/bin/perl # http://www.perlmonks.org/?node_id=447234 my @loc = (); my $x; while (<DATA>) { lc; chomp; push (@loc,$_); } #print "Available locations:\n" . join("\n", sort @loc); my %score = (); #my @phrases = ("Place de la Gare", "Rennes"); my @phrases = ("gare","er","n"); my $phrase; foreach $phrase (@phrases) { my @matches = grep(/$phrase/i, @loc); foreach my $match (@matches) { $score{$match}++; } } my $hiscore = 0; foreach my $hit (keys %score) { my $s = $score{$hit}; $hiscore = $s if $s > $hiscore; push (@{$hits[$s]},$hit); } # just print highest scoring ones print "Top scoring matches with a score of $hiscore:\n"; foreach my $toploc (@{$hits[$hiscore]}) { print "$toploc\n"; } __DATA__ Place De La Gare - Angers Place De La Gare - Nevers Place Mohammed V - Oujda Place De La Gare - Rennes Place de la Gare - Quimper Place Thiers - Nancy Place De La Gare - Grenoble Place Du Chateau - Galerie Marchande Du Rer Place De La Gare - Angers Place De La Gare 1 - Bannes Grenoble Place De La Gare - Nevers Place De La Gare - Rennes Place De La Gare bannes Place de la Gare Place de la Gare - Bergerac Place de la Gare - Moutiers Place de la Gare - Libourne

In reply to Re: Guessing/Ordering Partial Data by mattr
in thread Guessing/Ordering Partial Data by ropey

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.