citromatik has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have a big number (~10e6) of strings (in fact, titles of articles), and I want to cluster together those that are very similar (i.e. those that are the same title, or misspellings of the same article). Typically, the frequency of titles (i.e. the size of the final clusters) varies between 1 and a few hundreds.

Maybe, the obvious solution for smaller sets of "titles" would be to make a matrix with the Levenshtein distance (with Text::Levenshtein) of all the title pairs and cluster together all the titles having a distance lower than a cut-off (for example, 3). But in this case, the number of comparisons needed makes this approach unfeasible.

Can you think of an "efficient" solution (algorithm/data structure or whatever) that can help here?

Any help would be appreciated,

citromatik

Replies are listed 'Best First'.
Re: Cluster a big bunch of strings
by ikegami (Patriarch) on Mar 30, 2009 at 16:30 UTC

      :=O !!

      “Ooooh! Aaahhh!”

      Just when you think you're having a pretty-ordinary Monday morning, some bona fide rocket-scientist comes along ...

Re: Cluster a big bunch of strings
by Limbic~Region (Chancellor) on Mar 30, 2009 at 18:03 UTC
    citromatik,
    How often are these titles changing? How important is it to get "this article belongs to this cluster" right? How "tight" are the clusters - these articles all talk about social networking sites vs these articles all have to do with lawsuits against Facebook? I have lots of ideas but it all depends on knowing more about your goals.

    See for instance Efficient Fuzzy Matching Of An Address. In working through that problem I ran down a number of dead ends which might be just right for you.

    Cheers - L~R

      How often are these titles changing?

      More titles are added each day, but I re-analyze the data once a month.

      How important is it to get "this article belongs to this cluster" right?

      It is important, every failure in the process is a bug in the system.

      How "tight" are the clusters - these articles all talk about social networking sites vs these articles all have to do with lawsuits against Facebook?

      All are articles describing the analysis of metagenomic samples (samples that contains many bacterial species), i.e. where the samples were collected (geographical place), how the sample was treated, what results did they obtained...

      A couple of title examples:

      1 "Comparative analysis of bacterial communities passing through the g +uts of three earthworm species with rRNA-based techniques" 2 "Prokaryotic Diversity of An Inland Salt Habitat Investigated by usi +ng Two Different Molecular Approaches" 3 "High diversity of fungi recovered from the roots of mature tanoak ( +Lithocarpus densiflorus) in northern California"

      Thanks for your interest and help

      citromatik

        citromatik,
        You really didn't answer my last question which means I probably didn't ask it well. Is your goal only to identify duplicate articles (typos and what not) or is it to group related articles as well? If it is to group related articles, what criterion are you using. I gave an example relating to social network sites. A "loose" group might be anything having to do with a social networking site where a "tighter" group might only have to do with lawsuits against Facebook.

        What do you intend to do if an article can belong to more than one cluster? You mentioned that it would be considered a bug if an article ended up in the wrong cluster - how would you know? Is there a subjective element to this task or can you define the criterion for clustering in black and white terms. In other words, if randomly selected 1000 articles and were to cluster them by hand - could you write code that would produce the same results (where computational time is not a factor)?

        Cheers - L~R

Re: Cluster a big bunch of strings
by kennethk (Abbot) on Mar 30, 2009 at 16:45 UTC

    While you start out with a large number of strings, you could certainly reduce the number of required comparisons by precomputing some metrics. For example, checking for equivalence is fairly cheap. As well, assuming significant variability in title length, if you pick a fixed metric of a Levenshtein distance of 3, two strings with lengths different by 4 or more could be dropped immediately.

    I'm a little surprised that 1 trillion comparisons is computationally infeasible. Are you trying to do this dynamically? If so, it would seem your best bet would be caching results in a DB, thus reducing the problem to one large initial data crunch followed by a much smaller insertion operation for new additions.

    My last thought is that probability of a typographical error is proportional to string length, so you may want to use a relative distance rather than an absolute one.

      I'm a little surprised that 1 trillion comparisons is computationally infeasible.
      Why? 1GHz is one billion cycles per second, so 1 trillion CPU instructions will take about 10 minutes. Given that a string comparison with multiple function calls will take at least hundreds or thousands of CPU cycles, this will take at least a day or so. Probably a lot more.

      If you want to do fast, approximate comparison of strings, a lot of work has been done on that in bioinformatics. In particular, the popular BLAST and PSI-BLAST algorithms more or less solve this problem by creating an index of short substrings, then doing a full comparison where they match. You might try to develop something similar.

        I come from a scientific computing background, so letting a machine chew on some code for a month is not computationally infeasible to me. Obviously this is unacceptable if this is to be done on the fly (or if the deadline is next week), but doing this in a one-off fashion and storing the results doesn't seem out there to me.

        To get some scale, the following code randomly generates a set of strings and then runs them through the equality operator. The strings vary in length and content, but to reduce early exits, I only use letters a and b. It takes around 60 seconds to run as posted. Assuming N^2 scaling, this would take short of 6 days on my laptop (2.4 GHz). Of course, given the size of the problem, there'll be some slow down due to memory access, but there's certainly plenty of room for optimization in what I posted.

        Update: For those who are wondering, worst case scenario for the above (all strings given by 'a' x 100) takes 75 seconds on my machine. And just considering lengths is roughly comparable in all cases.

Re: Cluster a big bunch of strings
by hawtin (Prior) on Mar 31, 2009 at 12:55 UTC

    I had a similar issue when I set up a web site that consolidates music charts from around the world. In my case I wanted to match names of artists and titles of songs, the source data tended to have challenges ranging from the simple ("Beatles" v "The Beatles" and "Gary Crosby" v "Gary Crosby and his Orchestra") to the really quite complex ("Paul McCartney" v "Wings").

    I have used a modified Levenshtein distance algorithm on strings that have been Soundex-ed, which was then tuned so that processing the 279,524 entries currently takes 4 or so hours on a reasonable machine.

    You have to match the processing you are doing with the types of mistakes in your data. For names this will be spelling mistakes (so when matching names you must preprocess them with something like Soundex). For titles I would guess that, like my song titles, the words tend to be spelt correctly but you get them in different orders, you get issues like:

    • "Que sera sera (Whatever will be will be)" v "Whatever will be will be (Que sera sera)"
    • "Do What U Gotta Do" v "Do What You've Got to Do"
    • "Maybe It's Because" v "Maybe its because"
    • "Aint it Funny" v "Ain't it funny (Pettibone Remix)"
    • "Jazz-A-Samba... Part 1" v "Jazz A Samba"
    • "Would'ja Mind?" v "Would you Mind"

    So you want to process the names to spot the type of transcription error you have got. For example I take the distinctive words independent of order and use that as a first pass check, if you can quickly spot items that don't match then the tiny minority that could match can have distances calculated with a more comprehensive algorithm. My guess would be that, as with my data, the matches are few and far between.

    Finally, in my case, there are always things that just don't match and have to be processed by hand. Don't be surprised to find that, for some of the dumb things people do, you just have to fix them manually. We have a continual effort to detect matches and fix them in our source data

Re: Cluster a big bunch of strings
by Bloodnok (Vicar) on Mar 30, 2009 at 17:36 UTC
    Is Text::Soundex of no use ??

    A user level that continues to overstate my experience :-))
Re: Cluster a big bunch of strings
by erix (Prior) on Mar 30, 2009 at 18:15 UTC

    As another easy non-perl solution, you might consider loading them into postgres and using the native full text search, and its indexing possibilities.

    Indexed searches are fast.

Re: Cluster a big bunch of strings
by dwhite20899 (Friar) on Apr 01, 2009 at 15:18 UTC
    citromatik,

    I do something similar with pseudo-random hash strings, and I collapsed them into something like clusters by using just the unique characters. see node_id=735177. ikegami had some good comments there, too.

    Probably not your cup of tea. Now I'm going to go read more about that vector space engine...