in reply to Using Word Tokens as Features

First normalize your input, e.g. s/\W+/ /g.

Then count up words for each message using a hash table, and add those together for your whole corpus. From there, you should be able to calculate TF/IDF scores, which sounds like a homework problem.

Replies are listed 'Best First'.
Re^2: Using Word Tokens as Features
by MidLifeXis (Monsignor) on Apr 12, 2013 at 15:01 UTC

    ... and perhaps converting upper to lower...

    --MidLifeXis