If you don't want to read the article, I'll sum it up. Paul says that you should not try to build 'filters' for spam at all. A filter will always be vulnerable because Spammers are continually finding ways to defeat these filters. His solution seems to be pretty elegant. Look at the individual words in spam messages and 'real' messages and the look at your incoming message. Words in your incoming message are looked at individually and given a score based on whether it is a spam word or a good word. Total up the score for the incoming message and you have a very good filter that is self-correcting (it 'learns' more as the corpus of good and bad messages grows).
Mr. Graham goes on to show a few lines of Lisp:
that do the calculations (indecipherable to me, I grew up in Perl) and has a link to a page describing the underlying logic of Bayesian probabilities. http://www.mathpages.com/home/kmath267.htm.(let ((g (* 2 (or (gethash word good) 0))) (b (or (gethash word bad) 0))) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) (+ (min 1 (/ g ngood)) (min 1 (/ b nbad)))))))) and then . . . (let ((prod (apply #'* probs))) (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs)))))
I immediately dived into the page to figure out the 'magic formula' that I can start building a spam filter around . . . and basically sat drooling at the screen for 20 minutes trying to grok what was being talked about. Then I hopped onto CPAN, but could not find a module that does Bayesian probability calculations.
I'm looking for opinions about whether this would be a valuable addition to, perhaps, SpamAssassin (I haven't read the docs, but I believe that it can be inherited from and can accept additional filter types) and whether you know of a module that has already addressed Bayesian probabilities in Perl. If not, well, that's a good module for me to write, isn't it?
In reply to Bayesian Filtering for Spam by oakbox
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |