While you start out with a large number of strings, you could certainly reduce the number of required comparisons by precomputing some metrics. For example, checking for equivalence is fairly cheap. As well, assuming significant variability in title length, if you pick a fixed metric of a Levenshtein distance of 3, two strings with lengths different by 4 or more could be dropped immediately.

I'm a little surprised that 1 trillion comparisons is computationally infeasible. Are you trying to do this dynamically? If so, it would seem your best bet would be caching results in a DB, thus reducing the problem to one large initial data crunch followed by a much smaller insertion operation for new additions.

My last thought is that probability of a typographical error is proportional to string length, so you may want to use a relative distance rather than an absolute one.


In reply to Re: Cluster a big bunch of strings by kennethk
in thread Cluster a big bunch of strings by citromatik

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.