Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

Does anyone have a code to cluster files(using k means or any other algorithm) using vector notation? Each document is represented as matrix, where 1 implies the word is present and 0 implies the word is absent.

Eg: Let the complete word list of a set of documents is art, brick, ball, monk, pearl, road.
Contents of document d1 are: pearl, brick.

So, d1's vector is [0 1 0 0 1 0].

Thank You

Replies are listed 'Best First'.
Re: Clustering documents using vectors.
by BrowserUk (Patriarch) on Dec 14, 2012 at 13:15 UTC

    There isn't enough information here to construct a clustering algorithm.

    For example, let's say you have three documents using your complete word list:

    1. "art brick road";
    2. "art brick monk"
    3. "art ball road"

    There is no rational way to decide that any two of these documents are more closely related that any other pairing, hence there is no basis upon which to cluster them.

    You might try to introduce some artificial criteria by which they can be compared: eg. decide that shared words "lower" (say; leftmost) in your vector are more important that shared words "higher" in the vector; but such a comparison would be totally arbitrary taking no account of the semantic meaning of either the words; or the word pairings.

    You might decide that a ward pairing shared by one pair of documents is "more important" than another pairing because (say) the pairing has closer proximity within the documents. But now you have 3 metrics to combine: the words pair; and the distances between them in each of the documents; but how do you combine those three values into a single number that can be compared?

    And you need numbers before you can perform K-mean or other clustering algorithms.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

Re: Clustering documents using vectors.
by chromatic (Archbishop) on Dec 14, 2012 at 18:51 UTC