The words "two-dimensional hashing algorithm" floated across my mind as to how to avoid the cross-product of iterations you have there. That is to say, map individual strings onto a simpler two-dimensional space instead where proximity is proportional to similarity.

The obvious idea of using R^^2 as the range for the hashing function, doesn't inspire anything obvious because the similarity function clearly isn't associative across three strings as someone already freaked out about, therefore either more than two dimensions are needed, or a more esoteric space (a little googling against "similarity hashing" suggests that BOTH have been applied ie. fractal geometries and multi-dimensional metric spaces in the mix). In the similar problem of clustering species by similarity of genes, one approach was to use dissimilarity to construct a hierarchy of categories and have the actual data "find each other's similarity" further down the tree (see http://math.berkeley.edu/~datta/stanfordtalk.pdf). All of the work is recent (less than a decade old) and the most promising has been applied to "wikipedia in the pocket", based on this article. wikipedia itself only mentions the algorithm you already have, a.k.a the Rabin-Karp algorithm, which looks rather clunky compared to the more whizz-bang-maths efforts to pre-process a single string into its correct location in a metric space (do leave us a module on your way out ;))

__________________________________________________________________________________

^M Free your mind!


In reply to Re: Fast string similarity method by Moron
in thread Fast string similarity method by icanwin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.