Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a swear filter but it's too bloody strict for my liking. It blocks the A-- word out (and all other swears by sub'ing them for **** instead. It's too picky because it makes cl**** instead of class or ****umption instead of assumption. How can I get it to block full words rather than words containing swear words?

for (grep defined($_), (keys %chat){ my ( $name, $message, $ip) = split /~~/, $chat{$_}; $message =~ s/$_/****/g for @words; # swear words are evil!

Replies are listed 'Best First'.
Re: regex for swear filter
by kvale (Monsignor) on Feb 13, 2004 at 04:23 UTC
    Note that there is already a CPAN module dealing with your application: Regexp-Common-profanity_us-2.2. It may be easier and less error prone to use the module, or to at least to mine it for good tips.

    -Mark

Re: regex for swear filter
by Vautrin (Hermit) on Feb 13, 2004 at 04:13 UTC

    The \b switch matches on word boundaries, \W matches on non word charachter, and \s matches whitespace (and of course you could match spaces. I would use \b (unless you want to do something else, because \s and \W will get substituted out), i.e.:

    $sometext =~ s[\ba--\b][***]sgi;

    Also don't forget the i switch (so A-- doesn't go under the radar -- it does case insensitive)


    Want to support the EFF and FSF by buying cool stuff? Click here.
Re: regex for swear filter
by chimni (Pilgrim) on Feb 13, 2004 at 04:14 UTC

    My interpretation is that you dont want to match "patterns" that occur as part of larger words.
    For this look at the use of word boundaries
    would suggest a look at perlrequick(perldoc) for such problems,below is what it says about this one.
    The word anchor \b matches a boundary between a word character and a +non-word character \w\W or \W\w: $x = "Housecat catenates house and cat"; $x =~ /\bcat/; # matches cat in 'catenates' $x =~ /cat\b/; # matches cat in 'housecat' $x =~ /\bcat\b/; # matches 'cat' at end of string

    HTH
    chimni

      How are you going to discuss the comparative advantages and disadvantages of various beasts of burden with a filter like that?

        When forced by a prudish management to solve a similar problem, I assigned a point system. Each regex that applied would add or subtract points. Only those matches passing a point threshold would be scrubbed.

        For example, it's more likely to be an intentional curse if it's at the beginning or ending of a word. It's more likely to be an intentional curse if it is the whole word (word boundaries on both ends). It's less likely if it appears buried in a word; these are not filtered, much to the relief of residents of Scunthorp.

        --
        [ e d @ h a l l e y . c c ]

Re: regex for swear filter
by Corion (Patriarch) on Feb 13, 2004 at 07:19 UTC

    Just to add what kvale mentioned, there is also Regex::Common, which also has a profanity regex built-in, but the wordset is different from what I know.

    You didn't discuss why you want to filter "profanity", but take a look at this and consider the possible consequences before liberally applying a profanity filter.

      I have no intention to change anything of the 'profanity' regexp of Regexp::Common.[1] I only keep it their for backwards compatability reasons. I think it's far to controversial (someones profanity is someone elses common words) to be included.

      [1] Well, I might add words like 'gun', 'God', 'children', 'Perl 6', and 'George W. Bush'.

      Abigail