comment on

Result: "M0n-stur" only appears in mails that are spam. "Monster" appears in mail that is probably around 30-80% spam, depending on your specific mail traffic. This means you do not want to map the variation back to "monster". The presence of a variation is almost a dead give-away of spam.

Certainly, you don't want to treat "M0n-stur" as equivalent to "Monster", because it's an obfuscated variation. The point of the original poster is that naive Baysian filtering cannot keep up with all the possible variations of "|V|()|\|STER" whereas a good regex can "see" they're equivalent.

This is why naive Bayesian filtering works as well as it does for spam so far, despite being naive.

Only if all spammers use the exact same obfuscated text variants of spam keywords. Since there are literally millions of variants, and spammers are now actively trying to defeat Baysian filtering, that seems unlikely.

A good regex can, first of all, tell you "this is obfuscated text" which is enough to flag the mail as spam without figuring out what the text is supposed to be. Going further and finding the word which is being obfuscated could help more, and let a string like "|_33t |-|ax0R" pass as a probable joke, while having no tolerance for "\/|AG.RA" at all.

Your last point about stealth spam seems to be on target. However, it points out how we need many tools in our spam-fighting arsenal. One type of content filtering won't do it; and content filtering alone won't do it either. Hopefully we'll have some new weapons soon.

In reply to Re: Re^2: (OT) Fighting spam (naive, but not *that* naive) by forrest
in thread (OT) Fighting spam by Aristotle

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.