spaz has asked for the wisdom of the Perl Monks concerning the following question:

So as I was reading this wonderful idea about spam control I ran across some LISP:
(let ((g (* 2 (or (gethash word good) 0))) (b (or (gethash word bad) 0))) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) + (+ (min 1 (/ g ngood)) + (min 1 (/ b nbad)))))))))
and then later
(let ((prod (apply #'* probs))) (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) + probs)))))
I'd love to know how this works but I'm not too good with LISP, however Perl makes more sense than English sometimes!

Could somebody help with the translation?

-- Dave

Replies are listed 'Best First'.
Re: LISP translation help??
by spaz (Pilgrim) on Aug 16, 2002 at 19:52 UTC
    Alright, here's my attempt at the first one:
    sub findProb { my( $word ) = @_; $g = 2*$good{$word} || 0; $b = $bad{word} || 0; if( ($g+$b) < 5 ) { return( $g + $b ); } else { $num = min( 1, $b/$nbad ); $denom = min( 1, $g/$ngood ) + min( 1, $b/$nbad ); return( max( .01, min( .99, $num/$denom ) ) ); } }
    Any comments?

    -- Dave
      I've been looking at the article too. (N.B., The article is about applying Basyean techniques to spam filtering, but the code fragments are in LISP.)

      The first fragment translates to something more like:

      my %good; # count of token occurance in "good" email my %bad; # count of token occurance in "bad" email my $ngood; # number of "good" messages my $nbad; # number of "bad" messages sub findProb { my $word = shift; my $g = 2 * $good{$word} || 0; my $b = $bad{$word} || 0; return undef unless ($g + $b) > 5; my $num = min(1.0, $b/$nbad); my $denom = min(1.0, $g/$ngood) + min(1.0, $b/$nbad); my $prob = $num / $denom; return 0.99 if $prob > 0.99; return 0.01 if $prob < 0.01; return $prob; }
      In the article, Graham explains
      I want to bias the probabilities slightly to avoid false positives, and by trial and error I've found that a good way to do it is to double all the numbers in good. This helps to distinguish between words that occasionally do occur in legitimate email and words that almost never do. I only consider words that occur more than five times in total (actually, because of the doubling, occurring three times in nonspam mail would be enough). And then there is the question of what probability to assign to words that occur in one corpus but not the other. Again by trial and error I chose .01 and .99. There may be room for tuning here, but as the corpus grows such tuning will happen automatically anyway.
Re: LISP translation help??
by seattlejohn (Deacon) on Aug 16, 2002 at 22:52 UTC
    I believe the second works out like this:
    my $prod = 1; $prod *= $_ foreach @probs; my $inverse_prod = 1; $inverse_prod *= $_ foreach map {1-$_} @probs; return $prod / ($prod + $inverse_prod);