in reply to Robust Anti-Swear script

You could try "laundering" the text on its way in, rather than writing really complicated regexps. In other words, first render it in lowercase, remove all whitespace and punctuation, make substitions for l337 speak, then look for the nasties. It's fairly easy with tr:

$message = uc $message; # uppercasify $message =~ tr/ .;,/d; # remove spaces and punctuation $message =~ tr/13457/least/; # un-1337
(Warning, this is untested.) Of course you could combine those into a single statement. I just broke it up for clarity.

Do realize that the more munging you do in this fashion, the more prone your code will be to false positives. For example, the sample lines I wrote above would trigger on "I had amnesia" and "45 sweaters".

Replies are listed 'Best First'.
Re (2): Robust Anti-Swear script
by RatArsed (Monk) on Jul 31, 2001 at 12:39 UTC
    There are many issues with censorship, that'll try to avoid here, and I'd liek to extend the ideas already discussed in how to remove profanity

    I think you'd get a lot of false matches on perfectly harmless language, so you need to gaurd against a lot of these without annoying people (IMHO, you'd annoy more people by punishing innocent than not stopping profanity)

    The classic example I always use is my friend Dick from Scunthorpe. He has a pet Ass (as in donkey).

    Now there are (at least) three false trigers in that paragraph (although some things like "c ex" might trigger more), and leads on to dialect; In UK English, an Ass is nothing but a donkey. You sit on your Arse, but getting drunk is being rat-arsed.

    Perhaps on alternative is to match against bad, then match against that bad word's "safe" list (with words like Scunthorpe)

    Ultimately there are going to be mistakes, and the only perfect way to do it would be human; although I'd be interested in developing an AI engine that could be taught...

    --
    RatArsed

Re: Re: Robust Anti-Swear script
by Azhrarn (Friar) on Jul 31, 2001 at 00:45 UTC
    Yeah, that was something like my second idea. If I washed the text, it would be a bit easier to match against a wordlist.
    I don't really need to look for case though as //i gets rid of that. Also, I just thought of cases like multiple letters being substituted for one. Like "Ph" for "f." Would (ph|f) work on that for a word with the letter f in it? OR tr/(ph)/f/ ?
    Hopefully not to many people have 45 sweaters. ;)
      You just need a bigger wordlist. Since you're already going to need a list of bad words, why not put your leet speak words in there. Then you'll have fewer false matches on translated words like 45sweaters.

      Anything that is suspect (isn't found in wordlist but contains numbers) could then be translated and run through a second matching.

      It's all in your approach.

        That wouldn't stop people typing l i k e  t h i s, or t*h*i*s or ASCII swearwords. Besides, you'll never keep up with all the different ways of using letters to come up with nasty words, or swearing in foreign languages, or highly offensive phrases that don't use swearwords, or any combination of the above or or or...

        Using a word list (and listing all the bad words you can think of) is not the right way to go. It's pointless to try and keep up with the amount of cr*p people can spew out; it'll always overtake you in the end. Aim to filter out most of it, and leave the rest to real people.