I'm looking to do Fuzzy matching on text, and struggling to come up with a sensible solution.

I'm always comparing 2 strings, so Levenshtein distances are a possible, however as String::Approximate points out
"If you want to compare things like text or source code, consisting of words or tokens and phrases and sentences, or expressions and statements, you should probably use some other tool than String::Approx"

Here are some examples of text:
These two obviously match... they'd have a distance of 6, or 14% different
  • Aberdeen University Research Archive: AURA
  • Aberdeen University Research Archive
These two also obviously match... however they'd have a distance of 9, or 37% different
  • Archivio Marini
  • Archivio Giuliano Marini
However these two probably shouldn't match... even though they'd have a distance of 5, or 20% different
  • CCLRC ePublication Archive
  • STFC ePublication Archive
But what on earth does one do here - a distance of 39 over 50% different!
  • arXiv.org e-Print archive (physics, mathematics, related fields)
  • arXiv.org e-Print archive
... and I just dispare over this one:
  • Cracow University of Technology Digital Library
  • Biblioteka Cyfrowa Politechniki Krakowskiej (Digital Library of Cracow University of Technology)

Thoughts, opinons, suggestions greatfully sought....

(I don't think Text::Soundex will help either... :sad:)



-- Ian Stuart
A man depriving some poor village, somewhere, of a first-class idiot.

In reply to Fuzzy text matching... again by kiz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.