I agree with jhourcle's words:

(...) distinctions are context sensitive (...)

This is totally true - spammers know this fact and do use it to get around spam filters built this way. One approach we're looking at, tries to use a _capped_ number of replacement sets (ie, perform just 1 (one) to l (ell) transation at a time) and evaluate each of them against the regular expressions.

The results we're getting with this are better than with just regular expressions, but not spectacular. There are more knobs to turn (how many replacements to perform and evaluate, what value should every match add to the score and what is the threshold, for instance) in addition to the set of regexes that are used to detect spam-flag phrases.

A similar approach could be implemented using (hairy, IMHO) regexes. Those regexes are likely much harder to maintain and I guess they might be more expensive than the described approach. However, no testing has been done because we do not have a satisfactory solution to benchmark against yet.

Oh... and UTF is going to make for a very, very large set of glpyhs.

Indeed. This is why you must cap the amount of replacements to do when using this method.

Best regards

-lem, but some call me fokat


In reply to Re^2: Spam filtering and regular expressions by fokat
in thread Spam filtering and regular expressions by Mr. Lee

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.