Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Bayesian Filtering for Spam

by oakbox (Chaplain)
on Aug 17, 2002 at 09:03 UTC ( [id://190837]=perlmeditation: print w/replies, xml ) Need Help??

I read, with great interest, Paul Graham's article on filtering for spam using a Bayesian scoring system of individual words found in spam vs. 'good' email. http://www.paulgraham.com/spam.html This seems, at first blush, to be something Perl would excel at.

If you don't want to read the article, I'll sum it up. Paul says that you should not try to build 'filters' for spam at all. A filter will always be vulnerable because Spammers are continually finding ways to defeat these filters. His solution seems to be pretty elegant. Look at the individual words in spam messages and 'real' messages and the look at your incoming message. Words in your incoming message are looked at individually and given a score based on whether it is a spam word or a good word. Total up the score for the incoming message and you have a very good filter that is self-correcting (it 'learns' more as the corpus of good and bad messages grows).

Mr. Graham goes on to show a few lines of Lisp:

(let ((g (* 2 (or (gethash word good) 0))) (b (or (gethash word bad) 0))) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) (+ (min 1 (/ g ngood)) (min 1 (/ b nbad)))))))) and then . . . (let ((prod (apply #'* probs))) (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs)))))
that do the calculations (indecipherable to me, I grew up in Perl) and has a link to a page describing the underlying logic of Bayesian probabilities. http://www.mathpages.com/home/kmath267.htm.

I immediately dived into the page to figure out the 'magic formula' that I can start building a spam filter around . . . and basically sat drooling at the screen for 20 minutes trying to grok what was being talked about. Then I hopped onto CPAN, but could not find a module that does Bayesian probability calculations.

I'm looking for opinions about whether this would be a valuable addition to, perhaps, SpamAssassin (I haven't read the docs, but I believe that it can be inherited from and can accept additional filter types) and whether you know of a module that has already addressed Bayesian probabilities in Perl. If not, well, that's a good module for me to write, isn't it?

oakbox

Replies are listed 'Best First'.
Re: Bayesian Filtering for Spam
by bart (Canon) on Aug 17, 2002 at 21:00 UTC
    Mr. Graham goes on to show a few lines of Lisp:
    (let ((g (* 2 (or (gethash word good) 0))) (b (or (gethash word bad) 0))) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) (+ (min 1 (/ g ngood)) (min 1 (/ b nbad)))))))) and then . . . (let ((prod (apply #'* probs))) (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs)))))
    that do the calculations (indecipherable to me, I grew up in Perl)(...)

    I immediately dived into the page to figure out the 'magic formula' that I can start building a spam filter around . . . and basically sat drooling at the screen for 20 minutes trying to grok what was being talked about.

    Let me try to help you on your way.

    starting with the weirdest critters: "mapcar" is in Lisp what map() is in Perl. The first atom is quoted so it isn't executed immediately. "lambda" defines an anonymous function (here with one parameter, x). And "apply" in Lisp is what some perlers may know as reduce(). There's been extensive talk about it in the Perl6 RFC's, and the library List::Util implements it for people who'd like to use it with current day perls. Heh: the same library implements min() and max() as well. Good. So let's use that.

    Here's an attempt at a literal conversion into Perl:

    use List::Util qw(min max reduce); sub score { my($word) = @_; # uses global %good, %bad, $ngood, $nbad my $g = 2 * ($good{$word} || 0); my $b = $bad{$word} || 0; unless($g + $b < 5) { return max(0.01, min (0.99, min (1, $b / $nbad)/ (min(1, $g / $ngood) + min (1, $b / $nbad)))); } # otherwise: return undef } sub prob { my @probs = @_; my $prod = reduce { $a * $b } @probs; return $prod / ($prod + reduce { $a * $b } map { 1 - $_ } @probs); }
    There. That wasn't so hard, was it? But it's just a starting point, though. You won't filter any spam with it just like that, just yet.
      You won't filter any spam with it just like that, just yet.

      But it's a GREAT starting place, along with the more nuts and bolts explanation by elusion, I don't feel so lost with that code. Thanks to all of you for the great pointers.

      oakbox

Re: Bayesian Filtering for Spam
by elusion (Curate) on Aug 17, 2002 at 16:42 UTC
    I'm posting this here because it's more (I hope) than a mere translation of the code. But we'll start with that:
    sub min { # returns the smallest argument given my $min = shift; for (@_) { $min = $_ if $_ < $min; } return $min; } sub max { # returns the largest argument given my $max = shift; for (@_) { $max = $_ if $_ > $max; } return $max; } my %good; #number of occurences for each word in normal email my %bad; #number of occurences for each word in spam my $ngood; # number of good messages my $nbad; # number of bad messages sub score { # returns score for words given my @words = @_;#words in email my @probs; #probabilities for each word (not defined yet); for my $word (@words) { my $g = 2 * ($good{$word} || 0); my $b = $bad{$word} || 0; unless ($g + $b < 5) { push @probs, max( .01, min( .99, 1 / min($b / $nbad) ), min( 1, $g / $ngood ) + min( 1, $b / $nbad) ); } } my $prod = 1; $prod *= $_ for (@probs); my $prod2 = 1; $prod2 *= $_ for (map {1 - $_ } @probs); return $prod / ($prod + $prod2); }
    After splitting up the message into different words, you call score with the words to retrieve the probability that the message is spam. Or at least I believe that's how it works. Oh, and by the way: the given code is a new dialect of lisp that's being developed named arc.

    elusion : http://matt.diephouse.com

Re: Bayesian Filtering for Spam
by dws (Chancellor) on Aug 17, 2002 at 16:20 UTC
    I read, with great interest, Paul Graham's article on filtering for spam using a Bayesian scoring system of individual words found in spam vs. 'good' email.

    The translation of Graham's LISP fragments is the subject of LISP translation help ??.

    I did an implementation of Graham's algorithm yesterday, and will post it once I've cleaned it up and tuned it. It seems to work very well.

From a SpamAssassin developer
by Matts (Deacon) on Aug 18, 2002 at 08:17 UTC
    Hi, I'm one of the SpamAssassin developers.

    Yes, Bayesian filtering has been tested, and in fact works reasonably well, but does not generalise to a product like SpamAssassin does - it always requires training to the user's corpus. SpamAssassin's main market is gateway scanning, and as such we can't just ship out a bayesian classifier and expect it to "just work" like we do the current ruleset. It has to be trained to one individual's type of email.

    Also I think Paul get's very different spam to what we see in the project. I've got a bayesian classifier plugin for SpamAssassin - it's part of MessageLabs' proprietary extensions to SpamAssassin. But the bonus of it is that we can tune it for our customers because we're an ISP. However, even given that tuning, we're not seeing anywhere near the accuracy that Paul is seeing. Simply because our users has vastly different email corpuses to Paul.

    This system probably works great for Geeks though.

    Another thing to note is that I believe this is the training system that Apple Jaguar's new Mail.app is using. It seems to be working reasonably well for me so far too.

      I grant you that Paul Graham's traffic is easier to tell apart from spam than a marketing person's. I also submit that the system works much better for a single person with focussed interests than for multiple people with rather different interests. (Particularly when people disagree on spam. I consider chain mail spam. I know people who do not who send the junk to me occasionally...)

      However I would suggest looking very closely at his approach rather than just saying, He is doing Bayesian filtering, we do Bayesian filtering, worked better for him than us, must just be his data set. The fact is that he has tuned the numbers of his approach quite a bit, and some of that tuning is "wrong" from a strict Bayesian approach, but is probably very "right" from a spam elimination point of view.

      In particular if a word has only appeared in one or the other body of email, the probability that he assigns to it is .99 or .01 respectively. That means that if he repeatedly gets spams for the same products (which most people do), references to those products almost immediately become labelled as spam. Conversely approving a single email from a person goes a long way towards labelling any email from that company, person, or about that topic (based on subject keywords) as non-spam.

      A Bayesian approach to deciding how strong of evidence a given word is that something is spam would involve assigning a prior distribution and then modifying that upon observation. This would take several more observations to learn what words you do or do not like than Paul Graham's very rapid categorization process does. He then compounds this by artificially limiting his analysis to the 15 most distinctive words that he saw, which means that he is heavily biased towards making a decision based on rapid categorizations from a small section of the sample set.

      In other words Paul's algorithm likely works very well, but not necessarily for the theoretical reasons that he thinks applies.

        Well, I've now written what I think is basically what Paul has written in his lisp code (including stuff like discarding all but the most interesting 15 features) and tested it.

        The results are (unsurprisingly to me) not as accurate as Paul describes on mixed types of messages.

        The most important thing to remember about doing anything with probabilities is to not mix up your training and validation data sets. I get the feeling that Paul isn't doing that in calculating his statistics. I get zero false positives too when I validate against the training data set.

        However, on the plus side, the amount of data stored by his system compared to the pure Bayesian one used in AI::Categorize is significantly smaller. So I'll probably switch over to using this one instead.

        I'll post some of the code to the SpamAssassin list later today probably, in case someone wants to play with it some more.

      probably works great for Geeks though
      Is that because we are more likely to constantly tweak the setup, or because the mail received by geeks is easier to filter using this method?

      -Blake

        Because almost with 100% accuracy, any email with the word "perl" in it is not spam. And any email with certain spammy words in them, are spam.

        Think about the kind of marketing crap a CEO might subscribe to though. That's the sorts of things we have to deal with in SpamAssassin.

        However, now the cat seems to be out of the bag with bayesian filtering, we may as well provide some code to let people who run it themselves do their own customisation and allow it to run for them.

Re: Bayesian Filtering for Spam
by paisley (Novice) on Aug 17, 2002 at 16:40 UTC
    I don't know how much Bayesian statistics is Naive Bayesian statistics, but there is AI::Categorize::NaiveBayes, as discussed here...

    Also, from my quick scanning of messages from the SpamAssassin list this morning, it looks like that application has an algorithm not terribly unlike that suggested by Mr. Graham. I don't know what the differences are, nor what impact they have on the algorithms effectiveness.
Re: Bayesian Filtering for Spam
by thraxil (Prior) on Aug 17, 2002 at 16:22 UTC
Re: Bayesian Filtering for Spam
by hsmyers (Canon) on Aug 17, 2002 at 15:36 UTC

    I did much the same as you. But being out of copious spare time™, I'd suggest you have two choices if you want to do the do-it-yourself thing:

    • Learn Lisp.
    • Ask Paul Graham to explain the code.
    Of the two, the second is both more direct and most likely to be the shorter game plan. Besides if that doesn't work, you can fall back on the first suggestion!

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
      From the number of technical articles and reports Paul Graham writes on a regular basis, aside from other teaching work he might be involved in, I guess it's always more prudent to just learn lisp on your own and then try to decypher his spam lisp code with the gained knowledge. Imagine if there were literally hundreds of 'fans' seeking this knowledge from the wise? This could overwhelm anyone, even a guy holding Ph.D. Besides, even if he were to explain the code, he'll do it from the point of view of a Lisp expert and resort to describing aspects of Lisp involved in the code. Therefore, the best option then becomes to simply read an online Lisp reference on your own.


      _____________________
      # Under Construction
        The interesting twist is that this filter can probably be used to block or flag requests to explain the code. So that shouldn't be too much of a nuisance.

      Lisp is well easy. It's just, like, Polish notation, but with the ability to define functions :)

Re: Bayesian Filtering for Spam
by Anonymous Monk on Aug 18, 2002 at 17:29 UTC
    I also considered giving a weight to spam messages based on individual words in the messages. This works at the moment, because spam often contains words which are unlikely to appear in normal messages -- MLM or BEST KEPT SECRET!
    However, we ended up rejecting this for two basic reasons:
    1 - spam is evolving to pass through the content filters. We are receiving more spam with less obvious references to the product or scheme. In other words, they contain more words which appear in a normal message,
    2 - false positives are not acceptable!

    This made us consider phrasing as a focus. I believe it would be in this area that Bayesian filtering would be more likely to succeed. There is a difference in phrasing between someone who is trying induce an action than in sometime who just trying to talk about something.
    However, it would be a great danger to apply such rules at the ISP level as opposed to at the individual home or company level.
    A normal email message in an company that buys and sells commodities does not contain the same words or wording as a normal message exchanged between teenagers, which is also quite different from the normal messages exchanged by the adult members of the family. In fact, our filters rejected messaages because the family members were talking about the pharmaceuticals being used by grandfather.
    Such systems need to be trained to work in localized situations. The questions becomes whether the human resources and financial needed to train the system is worth the result.
    In our approach to dealing with spam we have ended up focussing on the money trail. We start with the assumption the message is intended to generate a response which results in your hard earned money migrating into the pocket of someone using dubious methods to build a business. The message must include a way for this to happen.
    We are increasingly focussing on that aspect of email messages to trap spam. This has helped reduce false positives. That part of the system has had no false positives in the past five days, while the traditional content based filters continue to generate false positives for spam.
    Using such an approach could improve the ability of Bayesian filters to evolve with the spam. If they used word weighting as a clue to the presence of spam and then honed in on areas of the message which could help trap future variants -- it might be able to evolve its filters. However, I would always be afraid to let the filters eveolve without continuous intervention by a human being. Until we can program judgement into spam filters, it will still require people to make to the final decision on spam.

      regarding phrasing you could use a WPE to convert basic anonymous word tokens into phrase tokens.

      did this for a recruitment company and it worked well.

      Advantage is that a WPE does not need compilatio ala Yapp and can be managed via a web interface by a non techie.

      Disadvantage is that it can be slow. Algorithm we developed for client was very fast but Oracle (SQL) centric so would not be a good fit for other RDBMS.

      The client also used statistical methods (as discussed here) but they saw a WPE as a major plus point.

      For our spam we block incoming ip address and obviously faked addresses - old hat theye days but works very well.

      Jacqui

Re: Bayesian Filtering for Spam
by dws (Chancellor) on Aug 19, 2002 at 20:40 UTC
    After playing with a Perl implementation for a while, I suspect that Graham simplified his implementation for purposes of writing the article. In particalar, the examples he shows suggest that he unconditionally converts tokens to lower case. I did this when transliterating his code, but found it failed when faced with short spams that read something like 25 MILLION EMAILS! http://... Both "million" and "emails" weight in below 0.5. But when I use a scheme that lowercases only mixed-case tokens (giving "million" and "MILLION" separate weights), I get fewer false negatives on spam. The problem is that the weighting tables gets a lot larger. Mine, drawn from a sample of about 2,000 spam and 1,500 non-spam emails, is 1.5Mb. That's rather large to be scanning when doing one-off email filtering (e.g., via procmail).

      I was thinking more along the lines of tie'ing the hashes to a DBM and doing the parse via cron at some kind of reasonable interval. Here are my thoughts now:

      • I don't want to keep an entire corpus of 'bad' and 'good' emails forever. The system should update the current word counts instead of redo-ing the counts from scratch every time. This means that 'single word' vs 'phrasing' should be hammered out during the design phase, if I dump the mails from the system, I can't go back and reparse them :)
      • I'm looking at the client interaction, how a client can/should flag a spam vs. flagging a 'good' message. I can see that there will be times when messages come in that are neither spam nor directed email. What about mis-addressed emails?
      • Should the client be able to set the probability of the filter? Do I put messages above that probability into a separate folder, delete them outright, or add them to the 'bad' email count automatically? What would be the most sane default behaviour?
      • In Graham's article, he talks about fiddling with the 'weight' of the good mail to get better results. What if you could put your thumb on the scale of the 'negativeness' of certain spam? Some messages are simply annoying, some spam is out and out offensive. Maybe I should have the program take that into account.
      • Which parts of this need to be in a module and which parts need to be in the client? Right now I'm thinking that the module should be able to accept the full text of a message (including headers), parse it, and store it in the appropriate DBM (with those optional weights in the previous point). The module should be able to accept the full text of a message and return the parsed word list. The module should be able to take a hashref of words and return a 'score' based on the words and their counts. Everything else should be on the client side.
      • How many messages do I need in the 'good' and 'bad' corpus before I can start relying on the probabilities it is giving me? Up to now, I've just been deleting spam, now I am storing them in a separate folder for using in my parse. This also goes into my client interaction design question, maybe basing the available probabilities on the number of messages in the corpus. (ex "You can only set the probability filter up to 50% with fewer than 200 spam messages.")
      • I can see a group of super-users who can be relied on to make informed spam decisions. Maybe by classifiying the spam. New users wouldn't need to rebuild the spam corpus, but could import the word counts from super-users. ("I want to import the 'porn', 'medical hoaxes', and 'stock tips' word counts from other, trusted users.") Then each user could further refine those counts with their own spam.

      Two big points I can see here are that the system learns without the user saying anything more than "This is spam", and that, because the counts are atomic, they can be shared. I have been reluctant to go with a black list because I think there is the possibility of abuse. Most spam filters require continual updating (which means that you have to be a sysadmin or you have to know what the hell you are doing.) I know that they are effective, I just don't want to have to think about it all the time (as a user or as a sysadmin).

      That's about all I have to say about that for now. If you see some questions that I'm not asking, let me know.

      oakbox

        This means that 'single word' vs 'phrasing' should be hammered out during the design phase, if I dump the mails from the system, I can't go back and reparse them :)
        But you can use the existing filter to accumulate a new corpus before you redefine the word parsing rules, so that shouldn't be such an awfully important concern.
        I'm looking at the client interaction, how a client can/should flag a spam vs. flagging a 'good' message. [ ... ] Do I put messages above that probability into a separate folder, delete them outright, or add them to the 'bad' email count automatically?
        I think this is one thing SpamAssassin has solved perfectly: the spam detector (I don't want to call it a filter) just tags mail by adding some extra headers. The user can then filter that to their liking using procmail, Mail::Audit or whatever else they may prefer. This approach obsoletes half of your user interaction questions outright. All decisions about what mail goes where are centralized in .procmailrc or the audit script, and the spam detector has fewer responsibilities and consecutively options. That makes both the code easier to maintain for developers as well as the configuration easier to maintain for end users and stays true to the Unix toolbox philosophy.

        Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://190837]
Approved by broquaint
Front-paged by jeffa
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-14 23:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found