in reply to Re^2: (OT) Fighting spam (naive, but not *that* naive)
in thread (OT) Fighting spam

Let me ask once more: how likely do you deem "M0n-stur" to be in legitimate mail?

Zero, of course. So what? A naive bayesian filter doesn't *know* that, until you've seen it already in some minimum number of spam messages -- by which time, the spammer has gone on to spell it some other way.

The presence of a variation is almost a dead give-away of spam.

Yes, but the computer isn't smart enough (*certainly* naive bayesian filters aren't smart enough) to know which spelling is correct. We can introduce larger and larger dictionaries, but one interesting thing about English is, regardless of how ginormous you make the dictionary there will be many perfectly cromulent words that are non-extant, and in any case how would you program your filter to distinguish innocent misspellings (if, for example, I had written "cromelent" above) from deliberately evasive ones ("monstur", "monstir", "mawnsteur", ...)?

Can the filters be made smart enough to know that "/\/\0N-STAR" is probably a deliberate misspelling? Yes, probably, assuming you don't get much legitimate mail that uses 1337 5P33|< -- and that was my point, or a large part of it. It's not good enough to treat "M()n5terr" as a new word that's never been seen before, filling my inbox with each new variation. Ideally the filter ought to figure out that it's a mangled form of the word "monster", but failing that it at least needs to be treated as a member of a class of words that match a known pattern. The latter is easier than the former, because all it requires is character classes. Actually figuring out which word is the unmangled original requires a very large and continuously updated dictionary, among other things. Though that would be a good thing to work toward, certainly, but the patterns based on the character classes can be built automatically from an existing corpus of mail (given, only, the character equivalence classes, and allowing for some "characters" to be more than one character long, e.g., "/\/\"); whereas, the dictionary would have to be hand-built almost entirely.

Furthermore, it's not good enough to assign some probability to "a variation of 'monster'" and be done. I want "monster", or any variation of monster, to have a much higher spam probability when it occurs in close proximity to "rod", or any variation thereof. This gets much more CPU-intensive, of course, but as noted, CPU time is cheaper than bandwidth and much cheaper than hiring a human to filter my mail.


$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/