Let me ask once more: how likely do you deem "M0n-stur" to be in legitimate mail?

Zero, of course. So what? A naive bayesian filter doesn't *know* that, until you've seen it already in some minimum number of spam messages -- by which time, the spammer has gone on to spell it some other way.

The presence of a variation is almost a dead give-away of spam.

Yes, but the computer isn't smart enough (*certainly* naive bayesian filters aren't smart enough) to know which spelling is correct. We can introduce larger and larger dictionaries, but one interesting thing about English is, regardless of how ginormous you make the dictionary there will be many perfectly cromulent words that are non-extant, and in any case how would you program your filter to distinguish innocent misspellings (if, for example, I had written "cromelent" above) from deliberately evasive ones ("monstur", "monstir", "mawnsteur", ...)?

Can the filters be made smart enough to know that "/\/\0N-STAR" is probably a deliberate misspelling? Yes, probably, assuming you don't get much legitimate mail that uses 1337 5P33|< -- and that was my point, or a large part of it. It's not good enough to treat "M()n5terr" as a new word that's never been seen before, filling my inbox with each new variation. Ideally the filter ought to figure out that it's a mangled form of the word "monster", but failing that it at least needs to be treated as a member of a class of words that match a known pattern. The latter is easier than the former, because all it requires is character classes. Actually figuring out which word is the unmangled original requires a very large and continuously updated dictionary, among other things. Though that would be a good thing to work toward, certainly, but the patterns based on the character classes can be built automatically from an existing corpus of mail (given, only, the character equivalence classes, and allowing for some "characters" to be more than one character long, e.g., "/\/\"); whereas, the dictionary would have to be hand-built almost entirely.

Furthermore, it's not good enough to assign some probability to "a variation of 'monster'" and be done. I want "monster", or any variation of monster, to have a much higher spam probability when it occurs in close proximity to "rod", or any variation thereof. This gets much more CPU-intensive, of course, but as noted, CPU time is cheaper than bandwidth and much cheaper than hiring a human to filter my mail.


$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/

In reply to Re: (OT) Fighting spam (naive, but not *that* naive) by jonadab
in thread (OT) Fighting spam by Aristotle

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.