in reply to text mining text tokenizing
Perhaps you may want to take a look at Ted Pedersen's Ngram Statistics and SenseClusters packages.
HTH,