in reply to Bayesian Filtering for Spam

I also considered giving a weight to spam messages based on individual words in the messages. This works at the moment, because spam often contains words which are unlikely to appear in normal messages -- MLM or BEST KEPT SECRET!
However, we ended up rejecting this for two basic reasons:
1 - spam is evolving to pass through the content filters. We are receiving more spam with less obvious references to the product or scheme. In other words, they contain more words which appear in a normal message,
2 - false positives are not acceptable!

This made us consider phrasing as a focus. I believe it would be in this area that Bayesian filtering would be more likely to succeed. There is a difference in phrasing between someone who is trying induce an action than in sometime who just trying to talk about something.
However, it would be a great danger to apply such rules at the ISP level as opposed to at the individual home or company level.
A normal email message in an company that buys and sells commodities does not contain the same words or wording as a normal message exchanged between teenagers, which is also quite different from the normal messages exchanged by the adult members of the family. In fact, our filters rejected messaages because the family members were talking about the pharmaceuticals being used by grandfather.
Such systems need to be trained to work in localized situations. The questions becomes whether the human resources and financial needed to train the system is worth the result.
In our approach to dealing with spam we have ended up focussing on the money trail. We start with the assumption the message is intended to generate a response which results in your hard earned money migrating into the pocket of someone using dubious methods to build a business. The message must include a way for this to happen.
We are increasingly focussing on that aspect of email messages to trap spam. This has helped reduce false positives. That part of the system has had no false positives in the past five days, while the traditional content based filters continue to generate false positives for spam.
Using such an approach could improve the ability of Bayesian filters to evolve with the spam. If they used word weighting as a clue to the presence of spam and then honed in on areas of the message which could help trap future variants -- it might be able to evolve its filters. However, I would always be afraid to let the filters eveolve without continuous intervention by a human being. Until we can program judgement into spam filters, it will still require people to make to the final decision on spam.

Replies are listed 'Best First'.
Re: Re: Bayesian Filtering for Spam
by Jacqui (Novice) on Aug 19, 2002 at 16:28 UTC

    regarding phrasing you could use a WPE to convert basic anonymous word tokens into phrase tokens.

    did this for a recruitment company and it worked well.

    Advantage is that a WPE does not need compilatio ala Yapp and can be managed via a web interface by a non techie.

    Disadvantage is that it can be slow. Algorithm we developed for client was very fast but Oracle (SQL) centric so would not be a good fit for other RDBMS.

    The client also used statistical methods (as discussed here) but they saw a WPE as a major plus point.

    For our spam we block incoming ip address and obviously faked addresses - old hat theye days but works very well.

    Jacqui