Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re^2: Spam filtering and regular expressions

by fokat (Deacon)
on Jul 30, 2005 at 19:30 UTC ( #479638=note: print w/replies, xml ) Need Help??

in reply to Re: Spam filtering and regular expressions
in thread Spam filtering and regular expressions

I agree with jhourcle's words:

(...) distinctions are context sensitive (...)

This is totally true - spammers know this fact and do use it to get around spam filters built this way. One approach we're looking at, tries to use a _capped_ number of replacement sets (ie, perform just 1 (one) to l (ell) transation at a time) and evaluate each of them against the regular expressions.

The results we're getting with this are better than with just regular expressions, but not spectacular. There are more knobs to turn (how many replacements to perform and evaluate, what value should every match add to the score and what is the threshold, for instance) in addition to the set of regexes that are used to detect spam-flag phrases.

A similar approach could be implemented using (hairy, IMHO) regexes. Those regexes are likely much harder to maintain and I guess they might be more expensive than the described approach. However, no testing has been done because we do not have a satisfactory solution to benchmark against yet.

Oh... and UTF is going to make for a very, very large set of glpyhs.

Indeed. This is why you must cap the amount of replacements to do when using this method.

Best regards

-lem, but some call me fokat

  • Comment on Re^2: Spam filtering and regular expressions

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://479638]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2021-11-27 20:40 GMT
Find Nodes?
    Voting Booth?

    No recent polls found