regex for swear filter

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex for swear filter by kvale (Monsignor) on Feb 13, 2004 at 04:23 UTC
Note that there is already a CPAN module dealing with your application: Regexp-Common-profanity_us-2.2. It may be easier and less error prone to use the module, or to at least to mine it for good tips. -Mark	[reply]
Re: regex for swear filter by Vautrin (Hermit) on Feb 13, 2004 at 04:13 UTC
The \b switch matches on word boundaries, \W matches on non word charachter, and \s matches whitespace (and of course you could match spaces. I would use \b (unless you want to do something else, because \s and \W will get substituted out), i.e.: `$sometext =~ s[\ba--\b][***]sgi;` [download] Also don't forget the i switch (so A-- doesn't go under the radar -- it does case insensitive) Want to support the EFF and FSF by buying cool stuff? Click here.	[reply] [d/l]
Re: regex for swear filter by chimni (Pilgrim) on Feb 13, 2004 at 04:14 UTC
My interpretation is that you dont want to match "patterns" that occur as part of larger words. For this look at the use of word boundaries would suggest a look at perlrequick(perldoc) for such problems,below is what it says about this one. `The word anchor \b matches a boundary between a word character and a +non-word character \w\W or \W\w: $x = "Housecat catenates house and cat"; $x =~ /\bcat/; # matches cat in 'catenates' $x =~ /cat\b/; # matches cat in 'housecat' $x =~ /\bcat\b/; # matches 'cat' at end of string` [download] HTH chimni	[reply] [d/l]
Re: Re: regex for swear filter by Anonymous Monk on Feb 13, 2004 at 04:29 UTC
How are you going to discuss the comparative advantages and disadvantages of various beasts of burden with a filter like that?	[reply]
Re: Re: Re: regex for swear filter by halley (Prior) on Feb 13, 2004 at 12:46 UTC
When forced by a prudish management to solve a similar problem, I assigned a point system. Each regex that applied would add or subtract points. Only those matches passing a point threshold would be scrubbed. For example, it's more likely to be an intentional curse if it's at the beginning or ending of a word. It's more likely to be an intentional curse if it is the whole word (word boundaries on both ends). It's less likely if it appears buried in a word; these are not filtered, much to the relief of residents of Scunthorp. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: regex for swear filter by Abigail-II (Bishop) on Feb 13, 2004 at 13:19 UTC
Re: Re: regex for swear filter by halley (Prior) on Feb 13, 2004 at 14:16 UTC
Some notes below your chosen depth have not been shown here
Re: regex for swear filter by Corion (Patriarch) on Feb 13, 2004 at 07:19 UTC
Just to add what kvale mentioned, there is also Regex::Common, which also has a profanity regex built-in, but the wordset is different from what I know. You didn't discuss why you want to filter "profanity", but take a look at this and consider the possible consequences before liberally applying a profanity filter.	[reply]
Re: regex for swear filter by Abigail-II (Bishop) on Feb 13, 2004 at 10:24 UTC
I have no intention to change anything of the 'profanity' regexp of Regexp::Common.^[1] I only keep it their for backwards compatability reasons. I think it's far to controversial (someones profanity is someone elses common words) to be included. ^[1] Well, I might add words like 'gun', 'God', 'children', 'Perl 6', and 'George W. Bush'. Abigail	[reply] [d/l] [select]