in reply to From a SpamAssassin developer
in thread Bayesian Filtering for Spam
However I would suggest looking very closely at his approach rather than just saying, He is doing Bayesian filtering, we do Bayesian filtering, worked better for him than us, must just be his data set. The fact is that he has tuned the numbers of his approach quite a bit, and some of that tuning is "wrong" from a strict Bayesian approach, but is probably very "right" from a spam elimination point of view.
In particular if a word has only appeared in one or the other body of email, the probability that he assigns to it is .99 or .01 respectively. That means that if he repeatedly gets spams for the same products (which most people do), references to those products almost immediately become labelled as spam. Conversely approving a single email from a person goes a long way towards labelling any email from that company, person, or about that topic (based on subject keywords) as non-spam.
A Bayesian approach to deciding how strong of evidence a given word is that something is spam would involve assigning a prior distribution and then modifying that upon observation. This would take several more observations to learn what words you do or do not like than Paul Graham's very rapid categorization process does. He then compounds this by artificially limiting his analysis to the 15 most distinctive words that he saw, which means that he is heavily biased towards making a decision based on rapid categorizations from a small section of the sample set.
In other words Paul's algorithm likely works very well, but not necessarily for the theoretical reasons that he thinks applies.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: From a SpamAssassin developer
by Matts (Deacon) on Aug 19, 2002 at 07:00 UTC | |
by Elian (Parson) on Aug 19, 2002 at 07:11 UTC | |
by Matts (Deacon) on Aug 19, 2002 at 16:24 UTC | |
by Anonymous Monk on Aug 20, 2002 at 01:23 UTC |