comment on

I read, with great interest, Paul Graham's article on filtering for spam using a Bayesian scoring system of individual words found in spam vs. 'good' email. http://www.paulgraham.com/spam.html This seems, at first blush, to be something Perl would excel at.

If you don't want to read the article, I'll sum it up. Paul says that you should not try to build 'filters' for spam at all. A filter will always be vulnerable because Spammers are continually finding ways to defeat these filters. His solution seems to be pretty elegant. Look at the individual words in spam messages and 'real' messages and the look at your incoming message. Words in your incoming message are looked at individually and given a score based on whether it is a spam word or a good word. Total up the score for the incoming message and you have a very good filter that is self-correcting (it 'learns' more as the corpus of good and bad messages grows).

Mr. Graham goes on to show a few lines of Lisp:

(let ((g (* 2 (or (gethash word good) 0)))
     (b (or (gethash word bad) 0))) 
       (unless (< (+ g b) 5) 
 (max .01 
    (min .99 (float (/ (min 1 (/ b nbad)) 
    (+ (min 1 (/ g ngood)) 
       (min 1 (/ b nbad))))))))

and then . . .

(let ((prod (apply #'* probs))) 
   (/ prod 
      (+ prod 
      (apply #'* 
        (mapcar #'(lambda (x) (- 1 x)) probs)))))
[download]

that do the calculations (indecipherable to me, I grew up in Perl) and has a link to a page describing the underlying logic of Bayesian probabilities. http://www.mathpages.com/home/kmath267.htm.

I immediately dived into the page to figure out the 'magic formula' that I can start building a spam filter around . . . and basically sat drooling at the screen for 20 minutes trying to grok what was being talked about. Then I hopped onto CPAN, but could not find a module that does Bayesian probability calculations.

I'm looking for opinions about whether this would be a valuable addition to, perhaps, SpamAssassin (I haven't read the docs, but I believe that it can be inherited from and can accept additional filter types) and whether you know of a module that has already addressed Bayesian probabilities in Perl. If not, well, that's a good module for me to write, isn't it?

oakbox

In reply to Bayesian Filtering for Spam by oakbox

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.