in reply to Spam filtering and regular expressions
You might want to ask around at the spam tools mailing list. I used to read it religiously when I was responsibe for maintaining spam filters.
I'm guessing someone's probably already done what you describe. If they haven't, I would probably handle it like soundex, but instead of grouping letters that sound like, grouping glyphs that look alike. (note, I specifically didn't say try to get them to the 'right' value, because the (0Oo) and (1lIi) distinctions are context sensitive ... (100K! M3ds @ lO% 0ff!), and the true meaning doesn't really matter, unless you're trying to determine if it's intentionally obfuscated, as opposed to just a suspicious keywords.)
Oh... and UTF is going to make for a very, very large set of glpyhs.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Spam filtering and regular expressions
by fokat (Deacon) on Jul 30, 2005 at 19:30 UTC |