comment on

In fact, it is better not to put them in the "correct" bucket, because as Paul Graham noted, where a spammer may try to subvert rule based filters with "vi.agra" instead of "viagra", the former will get marked as a 100% indicator for spam, where the latter might have been innocent.

The problem with this is, there are too many ways to mangle a word such as "viagra". I've seen fifty or so variations already.

Just a quick regex-based scrape:

VIagra
V.i.a.gra
Vi@gra
V i a g r a
V1AGR@
V1agra
VI.A.G.R.A
VIAGR@
V&#105;agra
Viagra
viagr@
VlAGR@
V.l.A.G.R.A
V_iagra
vi.a.g.r.a
Vi-agra
V I A G R A
V1AGRA
VViagra
V.iagra
Viagr a
V.I.A.G.R.A
Via-gra
Vviagra
Viagara
VlAGRA
Vi@gr@
V-i-@-g-r-a
V.IAGRA
V1@GRA
Viagraa
Via.gra
Viagrra
viagra
VIAGRA
Viagr@
Viagra
V%iagra
V|agr@
V,I,A,G,R,A
V.I,A.G,R.A
V iagra
Viagr*a
Vi^agra
V'1'a'g'r'a
Viagraaaaa
Via.graa
V-i-a-g-r-a
Vi.agra
v-i-a-g''r''a
V'l'a'g'r'a
Viagr.a
vit&agra
[download]

(And this missed all the ones that spelled the v as "\/" or the a as "/\", all the ones that used entities to obscure a letter other than i (which gets picked up because the entity contains a 1 by coincidence), and probably some others.

This is basic arithmetic: if there are four ways to do v, four ways to do a, eight ways to do i, seven places to add extra character(s), and a large number of different combinations of extra characters that can be added (any combination of punctuation, for example; I've also seen "creme" on the end, and I'm sure there are other possibilities), that makes 4*4*8*7*n different ways to spell the word, where n is a large number. Repeat for other popular drugs (vicodin gets spelled even more creatively, for example). Add to this the threshhold on how many times a word has to occur to be interesting, and just the order-prescription-drugs spammers alone will be sending you several *million* messages before your naive bayesian filters become effective.

This is only true for the serious hardcore mutating spam, the stuff that's always sent from Asia so as to be utterly untraceable, the stuff that gets a whole new subnet every month or so, the stuff that mutates every single aspect of the headers with just about every single message. However, since that stuff is most of the spam I get...

The only thing that's consistent about this stuff is that the IP address from which it's sent never EVER has a PTR record in in-addr.arpa space. If I ran my own mail server, the first thing I would want to implement is a ticket-verification scheme for messages sent from hosts without proper reverse DNS. 99% of the legit mail comes from a host with a proper PTR record, and that mail would be undelayed. The rest would go through one of those one-time verification systems wherein each sender would have to respond once to a verification probe and then would be whitelisted. (Of course, if everyone did this the scumbags would probably arrange to be a domain registrar so that it would cost them little or nothing to burn a domain for each batch of spam...)

See, this is the problem with Paul Graham's approach: the spammers are busy thinking about circumvention, an issue that he ignores completely. If we want to stop spammers from getting through our filters, we're going to have to be more thorough about our approach, in terms of predicting and preventing simple attacks. Naive bayesian filtering eats flaming death when the spammers switch from plain language to euphemism and throw in some Markov chains (thirty-year-old technology). I predicted this within five minutes after I read Paul Graham's original article on the topic. Sure enough, when I tried out ifile (seeded with thousands of messages in each category), it was maybe 75% effective, making errors in both directions -- useless. It was admittedly very good at filtering out the simplistic spam, especially things like 419 spam, but if failed miserably on the hard stuff. A simple technique is not going to solve the matter. The spammers combine techniques. Lots of techniques. We need to combine techniques as well. We need to apply regex technology, so that "moster rod" and "M0n-stur R0>" are the same phrase or at least considered very similar, and then we need to look at not just individual words but phrases, combinations of certain words together in close proximity to one another, and so forth, so that "M0n-stur R0>" scores as a close match to "Turn your rod into a monster." (Yeah, more CPU time. So be it. CPU time is cheaper than my time and cheaper than my bandwidth, too.) In short, our filters need to be less naive, need to combine various techniques. Can bayesian analysis help? Sure. Can it do the job by itself? No. Can regular expressions do the job? No. But they can help...

$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
[download]

In reply to Re: (OT) Fighting spam by jonadab
in thread (OT) Fighting spam by Aristotle

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.