There isn't enough information here to construct a clustering algorithm.
For example, let's say you have three documents using your complete word list:
There is no rational way to decide that any two of these documents are more closely related that any other pairing, hence there is no basis upon which to cluster them.
You might try to introduce some artificial criteria by which they can be compared: eg. decide that shared words "lower" (say; leftmost) in your vector are more important that shared words "higher" in the vector; but such a comparison would be totally arbitrary taking no account of the semantic meaning of either the words; or the word pairings.
You might decide that a ward pairing shared by one pair of documents is "more important" than another pairing because (say) the pairing has closer proximity within the documents. But now you have 3 metrics to combine: the words pair; and the distances between them in each of the documents; but how do you combine those three values into a single number that can be compared?
And you need numbers before you can perform K-mean or other clustering algorithms.
In reply to Re: Clustering documents using vectors.
by BrowserUk
in thread Clustering documents using vectors.
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |