in reply to Re: Bayesian Filtering for Spam
in thread Bayesian Filtering for Spam

I was thinking more along the lines of tie'ing the hashes to a DBM and doing the parse via cron at some kind of reasonable interval. Here are my thoughts now:

Two big points I can see here are that the system learns without the user saying anything more than "This is spam", and that, because the counts are atomic, they can be shared. I have been reluctant to go with a black list because I think there is the possibility of abuse. Most spam filters require continual updating (which means that you have to be a sysadmin or you have to know what the hell you are doing.) I know that they are effective, I just don't want to have to think about it all the time (as a user or as a sysadmin).

That's about all I have to say about that for now. If you see some questions that I'm not asking, let me know.

oakbox

Replies are listed 'Best First'.
Re^3: Bayesian Filtering for Spam
by Aristotle (Chancellor) on Aug 20, 2002 at 11:49 UTC
    This means that 'single word' vs 'phrasing' should be hammered out during the design phase, if I dump the mails from the system, I can't go back and reparse them :)
    But you can use the existing filter to accumulate a new corpus before you redefine the word parsing rules, so that shouldn't be such an awfully important concern.
    I'm looking at the client interaction, how a client can/should flag a spam vs. flagging a 'good' message. [ ... ] Do I put messages above that probability into a separate folder, delete them outright, or add them to the 'bad' email count automatically?
    I think this is one thing SpamAssassin has solved perfectly: the spam detector (I don't want to call it a filter) just tags mail by adding some extra headers. The user can then filter that to their liking using procmail, Mail::Audit or whatever else they may prefer. This approach obsoletes half of your user interaction questions outright. All decisions about what mail goes where are centralized in .procmailrc or the audit script, and the spam detector has fewer responsibilities and consecutively options. That makes both the code easier to maintain for developers as well as the configuration easier to maintain for end users and stays true to the Unix toolbox philosophy.

    Makeshifts last the longest.