Re: Re: LISP translation help??

I've been looking at the article too. (N.B., The article is about applying Basyean techniques to spam filtering, but the code fragments are in LISP.)

The first fragment translates to something more like:

my %good;  # count of token occurance in "good" email
my %bad;   # count of token occurance in "bad" email
my $ngood; # number of "good" messages
my $nbad;  # number of "bad" messages

sub findProb {
    my $word = shift;

    my $g = 2 * $good{$word} || 0;
    my $b =     $bad{$word}  || 0;

    return undef unless ($g + $b) > 5;

    my $num = min(1.0, $b/$nbad);
    my $denom = min(1.0, $g/$ngood) + min(1.0, $b/$nbad);
    my $prob = $num / $denom;

    return 0.99 if $prob > 0.99;
    return 0.01 if $prob < 0.01;
    return $prob;
}
[download]

In the article, Graham explains

I want to bias the probabilities slightly to avoid false positives, and by trial and error I've found that a good way to do it is to double all the numbers in good. This helps to distinguish between words that occasionally do occur in legitimate email and words that almost never do. I only consider words that occur more than five times in total (actually, because of the doubling, occurring three times in nonspam mail would be enough). And then there is the question of what probability to assign to words that occur in one corpus but not the other. Again by trial and error I chose .01 and .99. There may be room for tuning here, but as the corpus grows such tuning will happen automatically anyway.

Comment on Re: Re: LISP translation help?? Download Code