icanwin has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have a database of about 150.000 string literals per day, and I need to group the similars ones into clusters.

I'm currently using the String::Similarity perl module. It works fine but I have some performance problems. With a 5000 literal collection the system takes 3minutes more or less.

Update:Consider I have 5000 news headlines and I want to group them by similarity. I actually want to cluster all headlines 80% similars. To speed up the process, I don't compare the headlines its headlines length if greater than 20%.

My code looks like:
... my $docsProcessed; my $size = scalar (@{$arrayDocs}); for (my $i = 0; $i < ($size - 1); $i++) { # next if already processed next if (defined $docsProcessed->{$arrayDocs->[$i]}); for (my $j = $i + 1; $j < ($size - 1); $j++) { next if (defined $docsProcessed->{$arrayDocs->[$j]}); $similarity = similarity($arrayDocs->[$i], $arrayDocs->[$j], $se +lf->{THRESHOLD}); if ($similarity >= $self->{THRESHOLD}) { # Add the processed document into the cluster push (@{$clusters->{$arrayDocs->[$i]}}, $arrayDocs->[$j]); # Add the document to the processed hash $docsProcessed->{$arrayDocs->[$j]} = 1; } } } ...

I just wanted to know if there is other method, other modules o anything to speed up this process as much as possible.

Thank you very much

Replies are listed 'Best First'.
Re: Fast string similarity method
by Limbic~Region (Chancellor) on May 29, 2007 at 16:37 UTC
    icanwin,
    I have not read your implementation at all. If you can use another module to calculate similarity, Text::Levenshtein has a 'fastdistance' method (pure perl) and there is also Text::LevenshteinXS which uses C for speed.

    Update: As for your implementation, if you need to compare every string against every other string - the best you can hope for is N * N / 2 comparisons. On the other hand, you may be able to reduce this depending on your requirements.

    For instance, if you are interested in strings with an edit distance of less than 5, there is no need to compare strings who have a length difference of 5 or greater. To know the best way to reduce your comparisons, we will need to know more information about your input data and desired output (requirements).

    Cheers - L~R

      I already check for the string length in order to reduce the comparisions. I didn't post the code due to make it as short as possible.

      I'm currently testing the Text::LevenshteinXS method. It is extremly fast in comparision with the String::Similarity, but I still have to make some changes in order to get the same results.

      UPDATE

      I've made a mistake testing the Text::LevenshteinXSL. Here I post the real time each method took using a 5000 string collection

      Text::LevenshteinXS -> 2m 40s String::Similarity -> 3m
        icanwin,
        I am glad Text::LevenshteinXS looks promising. You have missed my other point though.

        You have asked for helping optimizing your solution but have not provided enough information to do so. There are plenty of "what if" scenarios but you are holding all the cards. If you want more advice beyond faster edit distance checking, please provide more information.

        Cheers - L~R

Re: Fast string similarity method
by jhourcle (Prior) on May 29, 2007 at 17:58 UTC

    I can't help with the matching itself, but for a slight adjustment, you might try benchmarking one of the following changes to your for loops:

    for ( my $i = $size; $i--; ) { ... } for ( my $j = $i; $j--; ) { ... }

    or

    foreach my $i ( 0 .. $size-1 ) { ... } foreach my $j ( $i+1 .. $size-1 ) { ... }

    or

    foreach my $i ( 0 .. $size-1 ) { ... } foreach my $string2 { @{$arrayDocs}[ $i+1 .. $size-1 ] } { # change references to '$arrayDocs->[$j]' to '$string2' }

    The first one can save a couple of operations per cycle (most likely, not significant compared to the contents of the loops, but it might shave off a second or two, and it works in other languages (see below)). The second one assumes that perl's optimization of iterating through a list of integers is faster than a 'for' loop (see below), and the last one tries to save time by reducing the number of times $arrayDocs->[$j] is referenced.

    You'd have to test the last one for yourself, as it's going to be affected by the qualities of the data (how many times you actually match)

    I know, people are going to complain that I'm optimizing the wrong part, but well, if it shaves off a few seconds at 5k records, it should take off ~900 times that amount at 150k records

Re: Fast string similarity method
by eric256 (Parson) on May 29, 2007 at 18:30 UTC

    How do you define clusters? It appears that you want all strings at least 80% similar to the current string. Does that mean that if you have A,B,C,D with A and B being clustered you would end up with A=>(B),B=>(A),C=>(),D=>()?? I think your above code would produce A=>(B),B=>(),C=>(),D=>() which doesn't seem right at all because then you can't tell when given B that A is close to it.

    It would seem to me that you could end up with A 80% of B, and B 80% of C but A could be only 60% of C...so then do you get A=>(B), B=>(A,C) but I don't think your code above handles that.

    I think you need to iterate over every element ever time. So that as you go you would create clusters of all terms centered around the current term. Then move onto the next term and scan them all agian. This of coures makes the solution slower not faster though! ;)

    My other thought would be to create a new cluster any time the current term doesn't fall inside a current cluster or falls near its edge. So say you start with A. You then compare B to the only cluster you currently have and find it is 90% similar to A so you add it to A. Then you check C, it is 80% similar to A so you add it to A but its below a threshold (say 85% for this excersise) so you create a new cluster C. Once you have finished scanning all elements for A then you move on to the next cluster and see what elements fall inside of it. I think when you are done you would have general clusters that items could be added to. So after the initial painfull additions, then when you add items instead of scanning 150,000 items for similarity you are scanning 50,000 clusters and creating new clusters as you go. Of coures then balancing your thresholds for joining new clusters or creating your own are very important. Enough rambling from me, but if you had some sample data it would be a fun diversion to mess with! ;) One last thought was to cluster around some dynamicaly generated centers, but I have no clue how to go about that.

    Update:It just occured to me that you could also recenter clusters occasionaly picking one string in that cluster that better represented its center. Probably some statistical device you could use to decide when this was needed.


    ___________
    Eric Hodges
Re: Fast string similarity method
by almut (Canon) on May 30, 2007 at 02:08 UTC

    A couple of additional general thoughts.

    Beyond optimizing the similarity computations, it's probably also worth rethinking your clustering algorithm. In particular, if you can't tell a priori what your clusters are (i.e. you can't name a prototypical, representative item for each cluster), but need to extract/estimate them from the data at hand.

    Unfortunately - as eric256 tried to hint at - when contemplating matters more thoroughly, even seemingly trivial clustering problems can turn out to be much more involved than most people would think at first. Also, doing it properly is often computationally prohibitively expensive for large input matrices. IOW, the intricacies of the clustering itself are often underestimated. Of course, YMMV and such, but there are many different approaches, and picking the one best suited for the purpose isn't always easy, even for the initiated.

    However, the problem domain isn't exactly new, and many books and scientific papers have been written on the subject. For example, the social sciences and biosciences are prominent areas of application, which have instigated a lot of investigations and development of related methodologies. (Having studied psychology during some past period in my life, I've inevitably had some exposure to the topic...)

    What I want to say is, in case you're more than "just playing" with the issue, and are serious about getting decent results, find yourself some introductory texts and familiarise yourself briefly with the basic clustering approaches. (Sorry, from the top of my head, I can't think of anything appropriate to link to, but googling should turn up some online resources, or at least pointers to hardcopy publications.)  Anyhow, it's certainly a good idea to learn from what others have already done. At least, you'd be wasting less time on reinventing the wheel... :)

    For the mathematically inclined, I'd like to mention a paper which describes an efficient clustering technique for huge input matrices. The authors suggest using Singular Value Decomposition (a well-established "tool" from linear algebra) to arrive at a solution for what they call the "generalised (continuous) clustering problem" (as opposed to the more conventional "discrete clustering problem" — see the paper for the differences). More importantly, though, they show how using a random subset from the input data can help to drastically cut down on the computational burden while still obtaining a very good estimate of the final cluster solution for the full data set. As an example application, the paper briefly discusses Latent Semantic Indexing (a topic which in itself might provide alternative perspectives to look at your problem - even if only remotely related to clustering headlines...).

    Hope this helps (rather than discourages :)

    Cheers,
    Almut

    BTW, instead of trying to implement any more advanced algorithms yourself from scratch, you probably want to look into something like R, for which there also is a Perl binding (the binding is version 0.02, and I've never used it myself, though).

Re: Fast string similarity method
by almut (Canon) on May 29, 2007 at 16:48 UTC

    Another similar module to benchmark against would be String::Approx (I haven't done any benchmarking myself, so I can't recommend one over the other).

      almut,
      I didn't mention String::Approx because of the disclaimer by the author in the POD:

      NOTE: String::Approx suits the task of string matching, not string comparison, and it works for strings, not for text.

      If you want to compare strings for similarity, you probably just want the Levenshtein edit distance (explained below),...

      String::Approx uses the Levenshtein edit distance (tLed) as its measure, but String::Approx is not well-suited for comparing the tLeds of strings, in other words, if you want a "fuzzy eq", see above. Strings::Approx is more like regular expressions or index(), it finds substrings that are close matches.

      Of course, having more information may reveal that this is the perfect tool for the job and that the other modules are less appropriate so thanks for bringing it up.

      Cheers - L~R

Re: Fast string similarity method
by derby (Abbot) on May 29, 2007 at 18:06 UTC

    I wonder if filtering stop words would be helpful? (especially if you can introduce some memoizing in there)

    -derby
Re: Fast string similarity method
by Random_Walk (Prior) on May 29, 2007 at 19:09 UTC

    Each string is compared to every other string and you are comparing words. As well as removing stopwords as already suggested you may get a speed up if you replace the actual words with tokens and then compare tokenised version of the strings.

    Of course this breaks the similarity between dog and dogs (removing all word terminal s's before tokenising may be an OK fix) and somewhat worse it breaks similarity between run and ran and no doubt many other gramatical form shifts. If it helps is down to your data really.

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
Re: Fast string similarity method
by bart (Canon) on May 30, 2007 at 10:47 UTC
    I think your speed problem is a result the fact that your method is O(N**2), as you compare every string with every other string.

    I wonder if it's not possible to use a form of "stemming", i.e. reduce the strings to a bare, basic form (the stem), and compare those instead. In the simple case where things are similar if and only if their stem is the same (think of SoundEx), you should be reducing the algorithm order to O(N). You may then expect a program speedup, for that range of number, of several orders of magnitude.

    Even when the case is not so simple, it might still be a helpful approach. Perhaps you can reduce the stems even more, for items that are only remotely similar, and extract a second grade similarity that way?

Re: Fast string similarity method
by BrowserUk (Patriarch) on May 30, 2007 at 19:49 UTC

    I've been hoping you might come back and clarify your purpose for this.

    My best guess is that you want to aggregate the headlines--for the same story but from different sources--together much in the fashion that google news does?

    If that is the case, I think that you are barking up the wrong tree with hoping that any purely character-based algorithmic mechanism for comparing strings (like String::Similarity or Text::Levenstein or String::Approx etc.) will work.

    For example, all of the above (though I haven't been able to test String::Similarity as it won't compile here) are likely to rate the following two pseudo headlines as being very similar:

    • Dog owner on trail after incident.
    • God Rowen on trial after incident.

    Assuming case insensitive comparisons, they are both the same length; contain the same characters; and the few transpositions required to turn one into the other occur in close proximity. But their meanings are entirely different. As any human reader will spot immediately, they are very unlikely to relate to the same story.

    Another example. Take a look at some of the 1,714 headlines relating to Google news' current top story. Whilst there are definite points of commonality between the great majority of them--'Bush', 'Zoellick', 'World', 'Bank'--, there are also those which have no discernible commonality at all. Eg. The Economist headlines the story with "Right second time".

    Even amongst those that do have several keywords in common, the placement/relationship of those keywords relative to the other words making up the headlines are such that any algorithm that treats the comparison as simple strings of characters is unlikely to do a good job at grouping them.

    To be in any way successful, it becomes necessary to break the strings into words and then compare them in terms of the similarity of the words contained rather than their relative placement. Of course, as others have mentioned, there are a whole bunch of common words (stop words) who's presence in a headline will do nothing useful in terms of grouping. They will be so common as to tend to create groupings (increase similarity) where none exists.

    And that leads to my suggestion, assuming that this is what you are attempting to do. You need to group the headlines according to commonality of their least common words.

    I would break the headlines into words and accumulate an occurrence count for each word across the entire corpus of the day's data (and probably accumulate these counts going forward). I would then 'tag' each headline using the two or three or four least common words it contains.

    Headlines would then be clustered according to the number of least common words they have in common.

    For example. Today, any headline containing the word 'Zoellick' is quite likely to relate to the new president of the World Bank. If it also contains any of the words 'bush', 'president', 'world' or 'bank' it is very likely to. And if it contains more than two, even more likely.

    Of course, that won't help with maverick or cutesy (like the Economist's effort above), or punning headlines (as favoured by the tabloid press), but short of performing a similar 'least common words in common' process on the content of the article, there is little that can be done about that.

    The good news is that if you went with this algorithm, you ought to be able to improve your performance markedly as it is a O( 2n ) process, rather than a O( m*n ) process.

    Anyway, maybe the thoughts will help. Good luck.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast string similarity method
by Moron (Curate) on May 31, 2007 at 12:36 UTC
    The words "two-dimensional hashing algorithm" floated across my mind as to how to avoid the cross-product of iterations you have there. That is to say, map individual strings onto a simpler two-dimensional space instead where proximity is proportional to similarity.

    The obvious idea of using R^^2 as the range for the hashing function, doesn't inspire anything obvious because the similarity function clearly isn't associative across three strings as someone already freaked out about, therefore either more than two dimensions are needed, or a more esoteric space (a little googling against "similarity hashing" suggests that BOTH have been applied ie. fractal geometries and multi-dimensional metric spaces in the mix). In the similar problem of clustering species by similarity of genes, one approach was to use dissimilarity to construct a hierarchy of categories and have the actual data "find each other's similarity" further down the tree (see http://math.berkeley.edu/~datta/stanfordtalk.pdf). All of the work is recent (less than a decade old) and the most promising has been applied to "wikipedia in the pocket", based on this article. wikipedia itself only mentions the algorithm you already have, a.k.a the Rabin-Karp algorithm, which looks rather clunky compared to the more whizz-bang-maths efforts to pre-process a single string into its correct location in a metric space (do leave us a module on your way out ;))

    __________________________________________________________________________________

    ^M Free your mind!

Re: Fast string similarity method
by neosamuri (Friar) on May 30, 2007 at 19:02 UTC

    If you define a cluster by a string which all others in the cluster are at least 80% similiar, then instead of comparing each string to all others, only compare against the cluster sting(the one which defines the cluster). There are then two cases to look at.

    If the string would fit into the 80% mark for more then one cluster, then I would suggest putting it in the one which it is most silimiar to it.

    The second case is the harder one. If it does not pass the 80% mark for any cluster then you want to create a cluster that it will fit into. The simplest way todo that is to just use it as the base for the new cluster. The problem is that you may not end up with the most disimiliar clusters, which is probably what you want.(Note that when the most disimilar cluster bases are found you will have the smallest number of clusters

    To solve that problem you would want to try and find a set of the most disimiliar strings in the set. The method for finding the most distant strings from the set would depend on the method for finding the similiarity between the strings.

    Of course just using the method above will give some preformance gain(depending on how many clusters are in the set).