Re: dirty word filter module?

I don't know of an implementation that is specific to 'dirty' words but spam filters that rely on the frequency and occurance patterns of certain words phrases and patterns are possibly a good model to use. Think of the data / documents that you are working with in the same way that an e-mail filter deals with mail.

Some documents containing the really nasty words can be blacklisted immediately. Some documents can set aside for review if they contain some words or derivations of the rude words that might warrant attention. The rest will be let through. The decision making is based on patterns and frequency rather than individual words.

The problem that you face is that (and I would love to be corrected) I haven't seen an implementation along these lines. You will also cope with updating the patterns, words and rules to cope with different situations that people find offensive. The filter that was used for children's content would annoy adults.

One implementation that I have seen used a search engine and a number of complex stored queries that represented rude words. Documents that ranked highly removed immediately. This approach could be adopted fairly easily since search engine software is readily available. Th trick is to assemble the queries to run against the documents.

Comment on Re: dirty word filter module?