in reply to (OT) Fighting spam

I'd like to put in a couple of short comments FWIW, the first being sublime optimism, and the second being raving paranoia.

First, the optimism: I have been astounded at the effectiveness of the "Naive Bayesian Filtering" I have observed both in the Mozilla filter, and from some fiddling around I have done. I can't believe this method has gone from "amazingly effective" to "dead end" in a few short months. It is called "naive" because it depends entirely upon the fundamental statistical method it uses. Suppose one were to add a dictionary (or even a heuristic) to recognize random words and nonsense words, or words with two letters transposed? Add, for instance, one or more weighted tokens to the statistical tables that are a function of this added analysis. It should add, I believe, about as much overhead as a spelling check. I think it is far too soon to give up on this simple, cheap(!), unobtrusive, uninvasive, and so far effective method of self defense. It might also incite senders of email to improve their speling skills.

Now the paranoia: I have watched the Internet progress from a joyful, open, friendly world-wide community in the direction of a wholly-owned means of delivering commercial dumbth, just like television, but with a reverse channel for credit card payments. It is progressing distressing quickly. Every time some scammer or opportunist pulls another fast one, the Internet gets more difficult, more paranoid, more regulated, more complicated, and more favorable to corporations with big budgets and less favorable to everyone else. The deterioriation results less from the bad guys than from the reaction to the bad guys.

Let us be very careful about adding complexity. Each new wrinkle makes it more difficult for the public, the amateurs, the open source contributors to compete with the enormously wealthy folks who want to take the Internet away from us. Worse, the more complex the system, no matter how well-intentioned, the more opportunities there are for the black-hats to exploit. Spammers, fraudsters, and panderers are going to continue to thrive on the 'net just as they do IRL. There will continue to be thousands of hijacked consumer appliances as long as crappy software is cheaper to produce than solid software, and there will always be those who respond to the junk email, because a certain portion of the population is going to continue to be credulous where they should be paranoid, like yours truly. Let us continue to oppose the exploiters, but very, very carefully.

Replies are listed 'Best First'.
Re^2: (OT) Fighting spam
by Aristotle (Chancellor) on Nov 16, 2003 at 21:43 UTC

    I have to think you haven't quite understood how Bayesian filtering works. The stuff you're talking about (random words, transpositions) already makes an impact in your statisticts. In fact, it is better not to put them in the "correct" bucket, because as Paul Graham noted, where a spammer may try to subvert rule based filters with "vi.agra" instead of "viagra", the former will get marked as a 100% indicator for spam, where the latter might have been innocent. Likewise goes for random words.

    As for the added complexity, it is not much complexity to add here at all. That's what's so appealing about it to me. There is no fundamental change in the way mail works with this scheme, as opposed to many others proposed so far. And I have a hard time following the argumentation that complexity necessarily makes a system easier to exploit. Taint checks make a program more complex, too. Encryption adds complexity, but I'm sure noone uses telnet for remote shells over the internet anymore. Complexity is not evil by itself - that's much too simplistic a world view. Everything should be as simple as possible, but no simpler (to invoke a well known quotation).

    Makeshifts last the longest.

      In fact, it is better not to put them in the "correct" bucket, because as Paul Graham noted, where a spammer may try to subvert rule based filters with "vi.agra" instead of "viagra", the former will get marked as a 100% indicator for spam, where the latter might have been innocent.

      The problem with this is, there are too many ways to mangle a word such as "viagra". I've seen fifty or so variations already.

      This is basic arithmetic: if there are four ways to do v, four ways to do a, eight ways to do i, seven places to add extra character(s), and a large number of different combinations of extra characters that can be added (any combination of punctuation, for example; I've also seen "creme" on the end, and I'm sure there are other possibilities), that makes 4*4*8*7*n different ways to spell the word, where n is a large number. Repeat for other popular drugs (vicodin gets spelled even more creatively, for example). Add to this the threshhold on how many times a word has to occur to be interesting, and just the order-prescription-drugs spammers alone will be sending you several *million* messages before your naive bayesian filters become effective.

      This is only true for the serious hardcore mutating spam, the stuff that's always sent from Asia so as to be utterly untraceable, the stuff that gets a whole new subnet every month or so, the stuff that mutates every single aspect of the headers with just about every single message. However, since that stuff is most of the spam I get...

      The only thing that's consistent about this stuff is that the IP address from which it's sent never EVER has a PTR record in in-addr.arpa space. If I ran my own mail server, the first thing I would want to implement is a ticket-verification scheme for messages sent from hosts without proper reverse DNS. 99% of the legit mail comes from a host with a proper PTR record, and that mail would be undelayed. The rest would go through one of those one-time verification systems wherein each sender would have to respond once to a verification probe and then would be whitelisted. (Of course, if everyone did this the scumbags would probably arrange to be a domain registrar so that it would cost them little or nothing to burn a domain for each batch of spam...)

      See, this is the problem with Paul Graham's approach: the spammers are busy thinking about circumvention, an issue that he ignores completely. If we want to stop spammers from getting through our filters, we're going to have to be more thorough about our approach, in terms of predicting and preventing simple attacks. Naive bayesian filtering eats flaming death when the spammers switch from plain language to euphemism and throw in some Markov chains (thirty-year-old technology). I predicted this within five minutes after I read Paul Graham's original article on the topic. Sure enough, when I tried out ifile (seeded with thousands of messages in each category), it was maybe 75% effective, making errors in both directions -- useless. It was admittedly very good at filtering out the simplistic spam, especially things like 419 spam, but if failed miserably on the hard stuff. A simple technique is not going to solve the matter. The spammers combine techniques. Lots of techniques. We need to combine techniques as well. We need to apply regex technology, so that "moster rod" and "M0n-stur R0>" are the same phrase or at least considered very similar, and then we need to look at not just individual words but phrases, combinations of certain words together in close proximity to one another, and so forth, so that "M0n-stur R0>" scores as a close match to "Turn your rod into a monster." (Yeah, more CPU time. So be it. CPU time is cheaper than my time and cheaper than my bandwidth, too.) In short, our filters need to be less naive, need to combine various techniques. Can bayesian analysis help? Sure. Can it do the job by itself? No. Can regular expressions do the job? No. But they can help...


      $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/

        You may think about Paul Graham whatever you want, but he's not that stupid. I am continuously surprised that people don't seem to get how and why Bayesian filtering is works so effectively for old-fashioned (more on that in a bit) spam.

        Let me ask once more: how likely do you deem "M0n-stur" to be in legitimate mail? How likely is it in spam? And what is the ratio of these probabilities? Now, how likely is "Monster" to be in legitimate mail? How likely is it in spam? And what is the ratio of these probabilities?

        Result: "M0n-stur" only appears in mails that are spam. "Monster" appears in mail that is probably around 30-80% spam, depending on your specific mail traffic. This means you do not want to map the variation back to "monster". The presence of a variation is almost a dead give-away of spam.

        This is why naive Bayesian filtering works as well as it does for spam so far, despite being naive.

        This extreme effectivity of Bayesian filters against obfuscated variations of keywords has prompted spammers to move on beyond variations. They are now circumphrasing, and not mentioning viagra, monster rods or whatever it is they're advertising at all.

        I am now occasionally getting mail along the lines of

        Subject: I never thought I'd see better days

        I was really in a bind until I found this, and now I can even afford to live carelessly. Believe me, it works.

        There is absolutely nothing in there that any kind of content based filter could pick out, unless it were to actually understand the message.

        This is why content based filtering is a dead end. Most of the things you describe will only fool rule based filters; statistical filters, a family of which Bayes is just one member, will pick them up reliably. But they cannot comprehend the message; hence spam such as what I outlined above, and which tachyon and Andy Lester observed as well.

        Makeshifts last the longest.