You might want to ask around at the spam tools mailing list. I used to read it religiously when I was responsibe for maintaining spam filters.
I'm guessing someone's probably already done what you describe. If they haven't, I would probably handle it like soundex, but instead of grouping letters that sound like, grouping glyphs that look alike. (note, I specifically didn't say try to get them to the 'right' value, because the (0Oo) and (1lIi) distinctions are context sensitive ... (100K! M3ds @ lO% 0ff!), and the true meaning doesn't really matter, unless you're trying to determine if it's intentionally obfuscated, as opposed to just a suspicious keywords.)
Oh... and UTF is going to make for a very, very large set of glpyhs.
In reply to Re: Spam filtering and regular expressions
by jhourcle
in thread Spam filtering and regular expressions
by Mr. Lee
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |