In my oreillynet blog today, I call for people to stop working on content-based spam filtering, since it's an unwinnable arms race. As an example, I got a spam today that led off with:
., ,; .R, @FS fUD jos DN Gw, Fzw OUn hdx DLdknFf: qgOKPugU aYkIda @ygoaQr Dj hN Sam xb tJ. mBT. fSV zek Nw; @Hf dxd Stk ALQ TZFwKw: qR ol HJb EmpiiA@ sb .Vz XWw chY:: Aw, ju iA GFk aHs,c woi FsrQua Gcc pW kA IBy HFd ZVx Gsx SME ziyA riA UNvhcHbgj NZaBdunU TYA NsaQfMzrRB , ,:;U : Ae , ,;w .: lze yrP IegDp.
Since that's filled with obvious non-words, I demonstrated how easy it would be to replace those with words or names.
open( my $fh, '/usr/share/dict/propernames' ) or die $!; while (<$fh>) { chomp; push( @{$words{length($_)}}, $_ ); } while (<DATA>) { s/(\S+)/replace($1)/ge; print; } sub replace { my $list = $words{length $_[0]} or return $_[0]; return $list->[rand @$list]; } __DATA__ ., ,; .r, @ln qly tlg nq aq, Brg iaB WiW iqpbduk: ifcciWvj Wypdip @rnoqqS lc st unx mm su. Wyl. eee daa jb; @kS kjt smp WkW 8hytct: ih xd WiZ Zlantc@ tg .vk WrW cyW:: hy, vx bo WnW gtx,i 0rW SnjsaS WbW gw oo kkZ rto WeW fvB 0qZ xbcd ocg tfrotxynk veqWhurb kdy wavkuseax0 , ,:;i : yr , ,;i .: Zjc ugr btfau.
which when run gives:
Ti Po Kaj Tao Wes Art Al Ian Jem Tao Raj Caroline Jeanette Harold Bradley Al Ji Stu Ro Hon Axel Kaj Tim Sam Stu Lee Tad Raj Phiroze Ed Ro Lin Shankar Hy Lex Ric Barry Van No Ji Jim Jerry Ram Sorrel Luc Ji Ji Kaj Van Mah Fay Art Hohn Ami Krzysztof Jennifer Jan Novorolsky , Saul : Ed , Per Ro Rob Bob Amedeo
Of course, you can use any list of words you like: /usr/share/dict/propernames just gives nicer results than /usr/share/dict/words did for this example.

xoxo,
Andy

Replies are listed 'Best First'.
Re: A handy use for /usr/share/dict/words
by Roy Johnson (Monsignor) on Feb 28, 2005 at 19:10 UTC
    It's not a two-state system, in which if you haven't "won", you have utterly lost. Having content-based filtering may be a 50 or 80 or 90% solution. Considering the amount of spam, that's considerable.

    You seem to think it's preferable to have a zero percent solution, since you don't suggest anything else that people should work on.


    Caution: Contents may have been coded under pressure.
      I'm not suggesting people remove their SpamAssassin installs. Lord knows I don't want to remove mine. What I am saying is that we're in to the area of diminishing returns. It's time for the next level of solution, more than the stopgap that we have.

      I don't know what the next solution is, but I do know that content-based filtering is not it. The people who are putting their time and energy into furthering such projects need to stop and point their energies elsewhere.

      xoxo,
      Andy

        To me, the next solution is to have a WWW::Mechanize script that goes to the links in an email body, spiders around the site looking for forms, and fills in any TextArea it finds with the body of the spam (other blanks can be filled in with the From and Subject information). The idea is to find their "Contact Us" space and give them feedback. And to give them increased traffic without increased sales, which is sort of indirect feedback.

        Caution: Contents may have been coded under pressure.
OT:Re: A handy use for /usr/share/dict/words
by zentara (Cardinal) on Mar 01, 2005 at 12:47 UTC
    Do you know the difference between Vioxx and Viagra? One is a Cox-2-Inhibitor, and the other is a Cox-2-Enhancer. :-)

    I'm not really a human, but I play one on earth. flash japh
      I like the idea of the former!