in reply to Bayesian Filtering for Spam

Mr. Graham goes on to show a few lines of Lisp:
(let ((g (* 2 (or (gethash word good) 0))) (b (or (gethash word bad) 0))) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) (+ (min 1 (/ g ngood)) (min 1 (/ b nbad)))))))) and then . . . (let ((prod (apply #'* probs))) (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs)))))
that do the calculations (indecipherable to me, I grew up in Perl)(...)

I immediately dived into the page to figure out the 'magic formula' that I can start building a spam filter around . . . and basically sat drooling at the screen for 20 minutes trying to grok what was being talked about.

Let me try to help you on your way.

starting with the weirdest critters: "mapcar" is in Lisp what map() is in Perl. The first atom is quoted so it isn't executed immediately. "lambda" defines an anonymous function (here with one parameter, x). And "apply" in Lisp is what some perlers may know as reduce(). There's been extensive talk about it in the Perl6 RFC's, and the library List::Util implements it for people who'd like to use it with current day perls. Heh: the same library implements min() and max() as well. Good. So let's use that.

Here's an attempt at a literal conversion into Perl:

use List::Util qw(min max reduce); sub score { my($word) = @_; # uses global %good, %bad, $ngood, $nbad my $g = 2 * ($good{$word} || 0); my $b = $bad{$word} || 0; unless($g + $b < 5) { return max(0.01, min (0.99, min (1, $b / $nbad)/ (min(1, $g / $ngood) + min (1, $b / $nbad)))); } # otherwise: return undef } sub prob { my @probs = @_; my $prod = reduce { $a * $b } @probs; return $prod / ($prod + reduce { $a * $b } map { 1 - $_ } @probs); }
There. That wasn't so hard, was it? But it's just a starting point, though. You won't filter any spam with it just like that, just yet.

Replies are listed 'Best First'.
Re: Re: Bayesian Filtering for Spam
by oakbox (Chaplain) on Aug 17, 2002 at 22:37 UTC
    You won't filter any spam with it just like that, just yet.

    But it's a GREAT starting place, along with the more nuts and bolts explanation by elusion, I don't feel so lost with that code. Thanks to all of you for the great pointers.

    oakbox