in reply to Re: LISP translation help??
in thread LISP translation help??
The first fragment translates to something more like:
In the article, Graham explainsmy %good; # count of token occurance in "good" email my %bad; # count of token occurance in "bad" email my $ngood; # number of "good" messages my $nbad; # number of "bad" messages sub findProb { my $word = shift; my $g = 2 * $good{$word} || 0; my $b = $bad{$word} || 0; return undef unless ($g + $b) > 5; my $num = min(1.0, $b/$nbad); my $denom = min(1.0, $g/$ngood) + min(1.0, $b/$nbad); my $prob = $num / $denom; return 0.99 if $prob > 0.99; return 0.01 if $prob < 0.01; return $prob; }
I want to bias the probabilities slightly to avoid false positives, and by trial and error I've found that a good way to do it is to double all the numbers in good. This helps to distinguish between words that occasionally do occur in legitimate email and words that almost never do. I only consider words that occur more than five times in total (actually, because of the doubling, occurring three times in nonspam mail would be enough). And then there is the question of what probability to assign to words that occur in one corpus but not the other. Again by trial and error I chose .01 and .99. There may be room for tuning here, but as the corpus grows such tuning will happen automatically anyway.
|
|---|