in reply to text mining text tokenizing

Perhaps you may want to take a look at Ted Pedersen's Ngram Statistics and SenseClusters packages.

HTH,

planetscape