comment on

I've been looking at the article too. (N.B., The article is about applying Basyean techniques to spam filtering, but the code fragments are in LISP.)

The first fragment translates to something more like:

my %good;  # count of token occurance in "good" email
my %bad;   # count of token occurance in "bad" email
my $ngood; # number of "good" messages
my $nbad;  # number of "bad" messages

sub findProb {
    my $word = shift;

    my $g = 2 * $good{$word} || 0;
    my $b =     $bad{$word}  || 0;

    return undef unless ($g + $b) > 5;

    my $num = min(1.0, $b/$nbad);
    my $denom = min(1.0, $g/$ngood) + min(1.0, $b/$nbad);
    my $prob = $num / $denom;

    return 0.99 if $prob > 0.99;
    return 0.01 if $prob < 0.01;
    return $prob;
}
[download]

In the article, Graham explains

I want to bias the probabilities slightly to avoid false positives, and by trial and error I've found that a good way to do it is to double all the numbers in good. This helps to distinguish between words that occasionally do occur in legitimate email and words that almost never do. I only consider words that occur more than five times in total (actually, because of the doubling, occurring three times in nonspam mail would be enough). And then there is the question of what probability to assign to words that occur in one corpus but not the other. Again by trial and error I chose .01 and .99. There may be room for tuning here, but as the corpus grows such tuning will happen automatically anyway.

In reply to Re: Re: LISP translation help?? by dws
in thread LISP translation help?? by spaz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.