in reply to Re^2: (OT) Fighting spam (naive, but not *that* naive)
in thread (OT) Fighting spam

Result: "M0n-stur" only appears in mails that are spam. "Monster" appears in mail that is probably around 30-80% spam, depending on your specific mail traffic. This means you do not want to map the variation back to "monster". The presence of a variation is almost a dead give-away of spam.
Certainly, you don't want to treat "M0n-stur" as equivalent to "Monster", because it's an obfuscated variation. The point of the original poster is that naive Baysian filtering cannot keep up with all the possible variations of "|V|()|\|STER" whereas a good regex can "see" they're equivalent.

This is why naive Bayesian filtering works as well as it does for spam so far, despite being naive.
Only if all spammers use the exact same obfuscated text variants of spam keywords. Since there are literally millions of variants, and spammers are now actively trying to defeat Baysian filtering, that seems unlikely.

A good regex can, first of all, tell you "this is obfuscated text" which is enough to flag the mail as spam without figuring out what the text is supposed to be. Going further and finding the word which is being obfuscated could help more, and let a string like "|_33t |-|ax0R" pass as a probable joke, while having no tolerance for "\/|AG.RA" at all.

Your last point about stealth spam seems to be on target. However, it points out how we need many tools in our spam-fighting arsenal. One type of content filtering won't do it; and content filtering alone won't do it either. Hopefully we'll have some new weapons soon.

  • Comment on Re: Re^2: (OT) Fighting spam (naive, but not *that* naive)