Re: Re: Bayesian Filtering for Spam

I was thinking more along the lines of tie'ing the hashes to a DBM and doing the parse via cron at some kind of reasonable interval. Here are my thoughts now:

I don't want to keep an entire corpus of 'bad' and 'good' emails forever. The system should update the current word counts instead of redo-ing the counts from scratch every time. This means that 'single word' vs 'phrasing' should be hammered out during the design phase, if I dump the mails from the system, I can't go back and reparse them :)
I'm looking at the client interaction, how a client can/should flag a spam vs. flagging a 'good' message. I can see that there will be times when messages come in that are neither spam nor directed email. What about mis-addressed emails?
Should the client be able to set the probability of the filter? Do I put messages above that probability into a separate folder, delete them outright, or add them to the 'bad' email count automatically? What would be the most sane default behaviour?
In Graham's article, he talks about fiddling with the 'weight' of the good mail to get better results. What if you could put your thumb on the scale of the 'negativeness' of certain spam? Some messages are simply annoying, some spam is out and out offensive. Maybe I should have the program take that into account.
Which parts of this need to be in a module and which parts need to be in the client? Right now I'm thinking that the module should be able to accept the full text of a message (including headers), parse it, and store it in the appropriate DBM (with those optional weights in the previous point). The module should be able to accept the full text of a message and return the parsed word list. The module should be able to take a hashref of words and return a 'score' based on the words and their counts. Everything else should be on the client side.
How many messages do I need in the 'good' and 'bad' corpus before I can start relying on the probabilities it is giving me? Up to now, I've just been deleting spam, now I am storing them in a separate folder for using in my parse. This also goes into my client interaction design question, maybe basing the available probabilities on the number of messages in the corpus. (ex "You can only set the probability filter up to 50% with fewer than 200 spam messages.")
I can see a group of super-users who can be relied on to make informed spam decisions. Maybe by classifiying the spam. New users wouldn't need to rebuild the spam corpus, but could import the word counts from super-users. ("I want to import the 'porn', 'medical hoaxes', and 'stock tips' word counts from other, trusted users.") Then each user could further refine those counts with their own spam.

Two big points I can see here are that the system learns without the user saying anything more than "This is spam", and that, because the counts are atomic, they can be shared. I have been reluctant to go with a black list because I think there is the possibility of abuse. Most spam filters require continual updating (which means that you have to be a sysadmin or you have to know what the hell you are doing.) I know that they are effective, I just don't want to have to think about it all the time (as a user or as a sysadmin).

That's about all I have to say about that for now. If you see some questions that I'm not asking, let me know.

oakbox

Comment on Re: Re: Bayesian Filtering for Spam

Replies are listed 'Best First'.
Re^3: Bayesian Filtering for Spam by Aristotle (Chancellor) on Aug 20, 2002 at 11:49 UTC
This means that 'single word' vs 'phrasing' should be hammered out during the design phase, if I dump the mails from the system, I can't go back and reparse them :) But you can use the existing filter to accumulate a new corpus before you redefine the word parsing rules, so that shouldn't be such an awfully important concern. I'm looking at the client interaction, how a client can/should flag a spam vs. flagging a 'good' message. [ ... ] Do I put messages above that probability into a separate folder, delete them outright, or add them to the 'bad' email count automatically? I think this is one thing SpamAssassin has solved perfectly: the spam detector (I don't want to call it a filter) just tags mail by adding some extra headers. The user can then filter that to their liking using `procmail`, Mail::Audit or whatever else they may prefer. This approach obsoletes half of your user interaction questions outright. All decisions about what mail goes where are centralized in `.procmailrc` or the audit script, and the spam detector has fewer responsibilities and consecutively options. That makes both the code easier to maintain for developers as well as the configuration easier to maintain for end users and stays true to the Unix toolbox philosophy. Makeshifts last the longest.	[reply]

Replies are listed 'Best First'.

Re^3: Bayesian Filtering for Spam
by Aristotle (Chancellor) on Aug 20, 2002 at 11:49 UTC

This means that 'single word' vs 'phrasing' should be hammered out during the design phase, if I dump the mails from the system, I can't go back and reparse them :)

I'm looking at the client interaction, how a client can/should flag a spam vs. flagging a 'good' message. [ ... ] Do I put messages above that probability into a separate folder, delete them outright, or add them to the 'bad' email count automatically?

procmail

Mail::Audit

.procmailrc

Makeshifts last the longest.

[reply]