You might want to ask around at the spam tools mailing list. I used to read it religiously when I was responsibe for maintaining spam filters.
I'm guessing someone's probably already done what you describe. If they haven't, I would probably handle it like soundex, but instead of grouping letters that sound like, grouping glyphs that look alike. (note, I specifically didn't say try to get them to the 'right' value, because the (0Oo) and (1lIi) distinctions are context sensitive ... (100K! M3ds @ lO% 0ff!), and the true meaning doesn't really matter, unless you're trying to determine if it's intentionally obfuscated, as opposed to just a suspicious keywords.)
Oh... and UTF is going to make for a very, very large set of glpyhs.
In reply to Re: Spam filtering and regular expressions
by jhourcle
in thread Spam filtering and regular expressions
by Mr. Lee
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |