Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

From a SpamAssassin developer

by Matts (Deacon)
on Aug 18, 2002 at 08:17 UTC ( [id://190965] : note . print w/replies, xml ) Need Help??

in reply to Bayesian Filtering for Spam

Hi, I'm one of the SpamAssassin developers.

Yes, Bayesian filtering has been tested, and in fact works reasonably well, but does not generalise to a product like SpamAssassin does - it always requires training to the user's corpus. SpamAssassin's main market is gateway scanning, and as such we can't just ship out a bayesian classifier and expect it to "just work" like we do the current ruleset. It has to be trained to one individual's type of email.

Also I think Paul get's very different spam to what we see in the project. I've got a bayesian classifier plugin for SpamAssassin - it's part of MessageLabs' proprietary extensions to SpamAssassin. But the bonus of it is that we can tune it for our customers because we're an ISP. However, even given that tuning, we're not seeing anywhere near the accuracy that Paul is seeing. Simply because our users has vastly different email corpuses to Paul.

This system probably works great for Geeks though.

Another thing to note is that I believe this is the training system that Apple Jaguar's new is using. It seems to be working reasonably well for me so far too.

Replies are listed 'Best First'.
Re: From a SpamAssassin developer
by Anonymous Monk on Aug 18, 2002 at 12:02 UTC
    I grant you that Paul Graham's traffic is easier to tell apart from spam than a marketing person's. I also submit that the system works much better for a single person with focussed interests than for multiple people with rather different interests. (Particularly when people disagree on spam. I consider chain mail spam. I know people who do not who send the junk to me occasionally...)

    However I would suggest looking very closely at his approach rather than just saying, He is doing Bayesian filtering, we do Bayesian filtering, worked better for him than us, must just be his data set. The fact is that he has tuned the numbers of his approach quite a bit, and some of that tuning is "wrong" from a strict Bayesian approach, but is probably very "right" from a spam elimination point of view.

    In particular if a word has only appeared in one or the other body of email, the probability that he assigns to it is .99 or .01 respectively. That means that if he repeatedly gets spams for the same products (which most people do), references to those products almost immediately become labelled as spam. Conversely approving a single email from a person goes a long way towards labelling any email from that company, person, or about that topic (based on subject keywords) as non-spam.

    A Bayesian approach to deciding how strong of evidence a given word is that something is spam would involve assigning a prior distribution and then modifying that upon observation. This would take several more observations to learn what words you do or do not like than Paul Graham's very rapid categorization process does. He then compounds this by artificially limiting his analysis to the 15 most distinctive words that he saw, which means that he is heavily biased towards making a decision based on rapid categorizations from a small section of the sample set.

    In other words Paul's algorithm likely works very well, but not necessarily for the theoretical reasons that he thinks applies.

      Well, I've now written what I think is basically what Paul has written in his lisp code (including stuff like discarding all but the most interesting 15 features) and tested it.

      The results are (unsurprisingly to me) not as accurate as Paul describes on mixed types of messages.

      The most important thing to remember about doing anything with probabilities is to not mix up your training and validation data sets. I get the feeling that Paul isn't doing that in calculating his statistics. I get zero false positives too when I validate against the training data set.

      However, on the plus side, the amount of data stored by his system compared to the pure Bayesian one used in AI::Categorize is significantly smaller. So I'll probably switch over to using this one instead.

      I'll post some of the code to the SpamAssassin list later today probably, in case someone wants to play with it some more.

        Don't be too surprised that Paul's solution's not a good general-purpose one. His data set's probably quite small, with good locality, and odds are he made sure to skew his results to his data. It's not that his methods are bad for his needs, just that his needs are rather different than most people's.
        What I would wonder is not whether his method works as well in a general-purpose environment as it does for him. I didn't expect it to. It is rather whether it works well enough to be useful, and how it performs relative to the pure Bayesian one that you already had.
Re: From a SpamAssassin developer
by blakem (Monsignor) on Aug 18, 2002 at 09:57 UTC
    probably works great for Geeks though
    Is that because we are more likely to constantly tweak the setup, or because the mail received by geeks is easier to filter using this method?


      Because almost with 100% accuracy, any email with the word "perl" in it is not spam. And any email with certain spammy words in them, are spam.

      Think about the kind of marketing crap a CEO might subscribe to though. That's the sorts of things we have to deal with in SpamAssassin.

      However, now the cat seems to be out of the bag with bayesian filtering, we may as well provide some code to let people who run it themselves do their own customisation and allow it to run for them.