Re: From a SpamAssassin developer

I grant you that Paul Graham's traffic is easier to tell apart from spam than a marketing person's. I also submit that the system works much better for a single person with focussed interests than for multiple people with rather different interests. (Particularly when people disagree on spam. I consider chain mail spam. I know people who do not who send the junk to me occasionally...)

However I would suggest looking very closely at his approach rather than just saying, He is doing Bayesian filtering, we do Bayesian filtering, worked better for him than us, must just be his data set. The fact is that he has tuned the numbers of his approach quite a bit, and some of that tuning is "wrong" from a strict Bayesian approach, but is probably very "right" from a spam elimination point of view.

In particular if a word has only appeared in one or the other body of email, the probability that he assigns to it is .99 or .01 respectively. That means that if he repeatedly gets spams for the same products (which most people do), references to those products almost immediately become labelled as spam. Conversely approving a single email from a person goes a long way towards labelling any email from that company, person, or about that topic (based on subject keywords) as non-spam.

A Bayesian approach to deciding how strong of evidence a given word is that something is spam would involve assigning a prior distribution and then modifying that upon observation. This would take several more observations to learn what words you do or do not like than Paul Graham's very rapid categorization process does. He then compounds this by artificially limiting his analysis to the 15 most distinctive words that he saw, which means that he is heavily biased towards making a decision based on rapid categorizations from a small section of the sample set.

In other words Paul's algorithm likely works very well, but not necessarily for the theoretical reasons that he thinks applies.

Comment on Re: From a SpamAssassin developer

Replies are listed 'Best First'.
Re: Re: From a SpamAssassin developer by Matts (Deacon) on Aug 19, 2002 at 07:00 UTC
Well, I've now written what I think is basically what Paul has written in his lisp code (including stuff like discarding all but the most interesting 15 features) and tested it. The results are (unsurprisingly to me) not as accurate as Paul describes on mixed types of messages. The most important thing to remember about doing anything with probabilities is to not mix up your training and validation data sets. I get the feeling that Paul isn't doing that in calculating his statistics. I get zero false positives too when I validate against the training data set. However, on the plus side, the amount of data stored by his system compared to the pure Bayesian one used in AI::Categorize is significantly smaller. So I'll probably switch over to using this one instead. I'll post some of the code to the SpamAssassin list later today probably, in case someone wants to play with it some more.	[reply]
Re: Re: Re: From a SpamAssassin developer by Elian (Parson) on Aug 19, 2002 at 07:11 UTC
Don't be too surprised that Paul's solution's not a good general-purpose one. His data set's probably quite small, with good locality, and odds are he made sure to skew his results to his data. It's not that his methods are bad for his needs, just that his needs are rather different than most people's.	[reply]
Re: Re: Re: Re: From a SpamAssassin developer by Matts (Deacon) on Aug 19, 2002 at 16:24 UTC
I'm not surprised. Not even slightly - see my original post. The biggest thing about statistical analysis is you simply cannot test it on the training data set. I get 100% accuracy when I do that. And it's not surprising. I'm speculating that's what PG did. But I could be wrong. And also the fact that the training often overfits. None of this is news to anyone versed in machine learning (which I'm starting to be ;-) Matt.	[reply]
Re: Re: Re: From a SpamAssassin developer by Anonymous Monk on Aug 20, 2002 at 01:23 UTC
What I would wonder is not whether his method works as well in a general-purpose environment as it does for him. I didn't expect it to. It is rather whether it works well enough to be useful, and how it performs relative to the pure Bayesian one that you already had.	[reply]