Hi all,
I have a big number (~10e6) of strings (in fact, titles of articles), and I want to cluster together those that are very similar (i.e. those that are the same title, or misspellings of the same article). Typically, the frequency of titles (i.e. the size of the final clusters) varies between 1 and a few hundreds.
Maybe, the obvious solution for smaller sets of "titles" would be to make a matrix with the Levenshtein distance (with Text::Levenshtein) of all the title pairs and cluster together all the titles having a distance lower than a cut-off (for example, 3). But in this case, the number of comparisons needed makes this approach unfeasible.
Can you think of an "efficient" solution (algorithm/data structure or whatever) that can help here?
Any help would be appreciated,
citromatik
In reply to Cluster a big bunch of strings by citromatik
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |