apprentice has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Brother and Sister Monks, verily I beseech thee.

Knowest thou of a perlish module which screeneth naughty words/phrases from tender eyes?

If ye lack such knowledge, perhaps thou knowest of a list/dictionary of such filth?

Methinks 'twould be very difficult mineself to know all the possible naughty words and an efficient method of winnowing them, so I am hoping for a divine solution.

Your truest disciple.


"Peace, love, and Perl...well, okay, mostly just Perl!" --me

Apprentice

Replies are listed 'Best First'.
Don't even bother. (was Re: dirty word filter module?)
by dragonchild (Archbishop) on Mar 31, 2004 at 20:35 UTC
    The entire endeavor is fraught with peril, young disciple. Here's but a small taste of what thou shalt encounter if thou continuest upon thy quest:
    1. Define what a "dirty word" is. The easy ones are, well, easy. But, what about "breast"? You going to block all recipes with "Add two chicken breasts"?
    2. Once you have that list, how are you going to match your candidate text against it? What about linebreaks? Whitespace? Punctuation (especially hypens and periods)?
    3. Unicode is most definitely not your friend here.
    4. Once you find a "dirty word", what are you going to do with it? What about the surrounding text?

    Additionally, there was a meditation about a month ago that cited a study demonstrating that people decipher words based on three things - the first letter, the last letter, and the set of letters in the middle. For example, you know these words:

    • fcuking
    • pneis
    • ashosle
    Yet, I can guarantee there is no filter for them.

    An even worse problem, which is that profanity is context, not content, is demonstrated by the following examples:

    • Flock you!
    • Fsck you!
    • You piece of pus-dripping goathair!

    Not a single "dirty word", yet their context is more offensive than most dirty words. (Arabs have refined swearing into an artform, you pus-dripping son of a motherless flea-bitten camel! *grins*)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      It's true that there is no perfect solution, but even a 70% solution may be good enough to "bother". We've put filters on our church's mail to get rid of porn spam sent to the pastor(!). The available filtering system is really crippled, but just plugging in a few of the better-known words have taken care of the bulk of it, without (so far) grabbing anything legitimate.

      Based on that experience, I would say that ".php?" is one of the worst dirty words. :-)


      The PerlMonk tr/// Advocate

      Well said, dragonchild.

      Before I go into makeing the matter even much worse I will start making it just a little more perilous. You said:

      For example, you know these words:

      • fcuking
      • pneis
      • ashosle

      m/f[1ciknu]{5}g/ m/p[1ein]{3}s/ #2 m/a[0hlos]{5}e/

      these work very well in some cases, but will do more than they are supposed to do in others.
      Take the obvious #2: so what's wrong with pines?
      But you probably won't know the less obvious all the time.

      Much more frightening than this is, once you started, your clients will likely want you to add bad words from Spanish, German, Yiddish, Portuguese, Russian, French, Finnish, Swedish, Polish and so on.
      Once you do that, there are no good words remaining. Almost all everyday language words do mean something very rude an a number of languages...
      there's no achieving the goal your task I'm afraid.

      Cheerio, Sören

      Behold, the dragonchild stirs and breathes fire, burning away the misty clouds of ignorance from mine eyes.

      I am humbled at the agility of the human parser which so easily recognizes words such as "asohsle".

      Verily, the darkness is great before me, yet there is light in offing and I see the way to win small battles.

      There are three regions which I must protect:
      1 - Usernames
      2 - Message Board postings
      3 - User writeups

      The first is perhaps the easiest. It will be limited to [\w-_]{6,20}. This reduces the possibilities. Perhaps a three-level process; first substitute out all hyphens and underscores, then m/a+[s5ho0l1]{4,}[e3]+/i, if true try to match a more detailed list of regexes. Also, in this case, I can be as tough as I like without having to explain why---I can simply say, "The password you chose is in use" for anything iffy.

      The next two are more complex, however, they are not time sensitive. Once a user makes a posts a message to the board or a writeup, I can mark it viewable only by the user and admins until it can be parsed by a more determined script. If that script determines there is anything troublesome, it can notify an admin to view and approve/disapprove/edit it before making it available or deleting it. Further, I can allow users to flag such items if they are offensive, again to be reviewed by an admin. This process should handle the 'two chicken breasts' for everyone except any fowl who may be on the site (I discriminate against no lifeform). As for the 'pus-dripping son of a motherless flea-bitten camel', I am sure the admins will make the appropriate reprimands once it is brought to their attention. The important thing is to prevent young eyes and their parents from inadvertantly seeing anything that can be avoided with reasonable caution.

      Again, I give thanks for your gift of enlightenment. I shall be stronger knowing better the foe I face.

      Yea, though I tilt at windmills, I am sworn to the quest. Come Sancho, the contest begins...


      "Peace, love, and Perl...well, okay, mostly just Perl!" --me

      Apprentice
        For usernames ... yes, it's not hard because the list of choices is much reduced, especially if they are case-insensitive. You can actually pre-register all the offensive ones and assert eminent domain over any new ones. For example, "Mohammed", "PapaDocDuvalier", or "Abu_Nidal" are offensive only in certain circles. Unless those circles intersect yours, you should be ok.

        It sounds like your site is moderated to some degree. Remember - if your site is successful, one of two situations happens:

        1. the less it can be moderated due to volume
        2. the more moderators you need to keep up with that volume

        I'm not saying anything as to the decision to be moderated or not. Perlmonks is moderated, and has hundreds of moderators, to boot. But, even PM doesn't have automated moderation. A human has to initiate everything.

        PM deals with offensive postings to the boards by allowing any post, then having automated reaping if the post is not approved and the reputation drops to -5 (or something like that). The posts are viewable, though, for a small amount of time. And, this is with the lack of accountability of the Anonymous Monk.

        User bios are given more latitude here, but the user has to have proven themselves (through the gaining of XP) for some time before gaining all the features. And, still there is community policing.

        Unless you absolutely cannot have a single posting be viewable to whomever it might offend (for whatever legal reasons), community policing is usually enough.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

Re: dirty word filter module?
by borisz (Canon) on Mar 31, 2004 at 20:30 UTC
      Many thanks to you, Brother borisz. These modules bring much gladness to my heart for they are a beginning.

      Many blessings upon you.



      "Peace, love, and Perl...well, okay, mostly just Perl!" --me

      Apprentice
Re: dirty word filter module?
by inman (Curate) on Mar 31, 2004 at 21:11 UTC
    I don't know of an implementation that is specific to 'dirty' words but spam filters that rely on the frequency and occurance patterns of certain words phrases and patterns are possibly a good model to use. Think of the data / documents that you are working with in the same way that an e-mail filter deals with mail.

    Some documents containing the really nasty words can be blacklisted immediately. Some documents can set aside for review if they contain some words or derivations of the rude words that might warrant attention. The rest will be let through. The decision making is based on patterns and frequency rather than individual words.

    The problem that you face is that (and I would love to be corrected) I haven't seen an implementation along these lines. You will also cope with updating the patterns, words and rules to cope with different situations that people find offensive. The filter that was used for children's content would annoy adults.

    One implementation that I have seen used a search engine and a number of complex stored queries that represented rude words. Documents that ranked highly removed immediately. This approach could be adopted fairly easily since search engine software is readily available. Th trick is to assemble the queries to run against the documents.

Re: dirty word filter module?
by Elijah (Hermit) on Mar 31, 2004 at 21:15 UTC
    looks like you are going to need this little guy.
    die,unless@ARGV;for(@ARGV){if(/.*\n*/){$_="* ";}print;}
    Let's see how many people I can piss off with this :)