coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

Trying to make a swearword filter. The swears table has list of swear words which won't be posted in here for logical reasons. I'm trying to get each word or phrase into an array for later processing.

my $data = qq(SELECT word FROM swears); my $sth = $dbh->prepare($data); $sth->execute() or die $dbh->errstr; while(my $ref = $sth->fetchrow_arrayref) { push @swearwords, $ref->[0]; }
Then I try to pull text from the database and apply the swear word filter but it's not filtering anything. I am pulling more data from the database so here's the fetch-> of that.
while ($sth->fetch) { $text =~ s/$_/****/ foreach @swearwords;
no bad word in $text is being replaced.

I did do a test print of the @swearwords array to see what was in there and the words did get slurped up.

Any ideas why $text isn't being filtered for swearwords?

Replies are listed 'Best First'.
Re: array of MySQL data for substitution
by CountZero (Bishop) on Mar 29, 2005 at 20:11 UTC
    use strict; my @swearwords = qw/bad worse ugly/; my $text='This is a bad sentence with some ugly and worse words'; $text=~s/$_/****/ foreach @swearwords; print $text;
    works nicely for me, so the error is not in the regex.

    Perhaps you should do  while (my $text=$sth->fetch) so you have at least a text to work on!

    <Update: Or you could do away with the foreach loop as follows:

    use strict; my $swearwords= join '|', qw/bad worse ugly/; my $text='This is a bad sentence with some ugly and worse words'; $text=~s/$swearwords/****/g ; print $text;

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: array of MySQL data for substitution
by cazz (Pilgrim) on Mar 29, 2005 at 20:02 UTC
    You are much better off doing all of the regexp at once. Try this:
    use Regexp::List; my $l = Regexp::List->new; my $re = $l->list2re(@swearwords);
    And then later on in your other loop, do this:
    $text =~ s/$re/****/g;
Re: array of MySQL data for substitution
by CountZero (Bishop) on Mar 29, 2005 at 20:37 UTC
    On a more general level of a "swearword filter", you will want to make your regex case-insensitive and avoid the substitution of "swearwords" which are part of "good" words.

    If "ass" is a swearword, you probably will not want "assistant" to become "****istant".

    And what will happen to a treatise about any of several hoofed mammals of the genus Equus, resembling and closely related to the horses but having a smaller build and longer ears, and including the domesticated donkey?

    Would you also want to apply censorship to the Holy Scripture?

    Of the domesticated species we read of,

    1. The she ass (Heb. 'athon), so named from its slowness (Gen. 12:16; 45:23; Num. 22:23; 1 Sam. 9:3).
    2. The male ass (Heb. hamor), the common working ass of Western Asia, so called from its red colour. Issachar is compared to a strong ass (Gen. 49:14). It was forbidden to yoke together an ass and an ox in the plough (Deut. 22:10).
    3. The ass's colt (Heb. 'air), mentioned Judg. 10:4; 12:14. It is rendered "foal" in Gen. 32:15; 49:11. (Comp. Job 11:12; Isa. 30:6.) The ass is an unclean animal, because it does not chew the cud (Lev. 11:26. Comp. 2 Kings 6:25). Asses constituted a considerable portion of wealth in ancient times (Gen. 12:16; 30:43; 1 Chr. 27:30; Job 1:3; 42:12). They were noted for their spirit and their attachment to their master (Isa. 1:3). They are frequently spoken of as having been ridden upon, as by Abraham (Gen. 22:3), Balaam (Num. 22:21), the disobedient prophet (1 Kings 13:23), the family of Abdon the judge, seventy in number (Judg. 12:14), Zipporah (Ex. 4:20), the Shunammite (1 Sam. 25:30), etc. Zechariah (9:9) predicted our Lord's triumphal entrance into Jerusalem, "riding upon an ass, and upon a colt," etc. (Matt. 21:5, R.V.).

    Of wild asses two species are noticed, (1) that called in Hebrew _'arod_, mentioned Job 39:5 and Dan. 5:21, noted for its swiftness; and (2) that called _pe're_, the wild ass of Asia (Job 39:6-8; 6:5; 11:12; Isa. 32:14; Jer. 2:24; 14:6, etc.). The wild ass was distinguished for its fleetness and its extreme shyness. In allusion to his mode of life, Ishmael is likened to a wild ass (Gen. 16:12. Here the word is simply rendered "wild" in the Authorized Version, but in the Revised Version, "wild-ass among men").

    Source: Easton's 1897 Bible Dictionary

    Just give it up, censorship never works.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      I would have voted this up for the advice about being careful for false positives. (ass is the obvious example in this case), but it could have been done without the additional rant. There are many reasons one may want to filter text. And blanking out swear words many people would appreciate. The answers given here, could be useful for more than just censorship as you call it.
        <PERL FREE CONTENT FOLLOWS>

        Well, let's just say "free speech" is one of my pets.

        Free speech is so fundamental to freedom as a whole that any -and I mean ANY- censorship is the beginning of the slippery slope to dictatorship.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      I disagree totally that censorship never works. Who are you to say this when you don't know the environment in which I am trying to censor? What if this was a chat system or guestbook for an elementary school web page? Would it be acceptable to allow the word ass?

      I think not. If it's a business site I don't see how the word "ass" would ever come into play. And what about asshole? This should be blocked because no matter how you argue it, this is a swear word.

      Same with the word fuck and shit, business sites or school sites shouldn't allow this. Are you upset with my fucking swearing? I only bring this up because someone wants this node deleted because you don't believe that swear words should be fucking censored.

      No, I'm not mad. I'm just simply saying swear words are ACCEPTABLE to be censored but NOT acceptable to be said in most circumstances. And who are you to say what I can and cannot sensor on my web site? Isn't that sensoring what I'm trying to say or do?

      I asked a question and most posters were nice enough to post solutions to the problem. There's always that one rotten apple in the bunch.

        You can do on your website whatever you like. I don't mind.

        If I don't like your website or you attitude, I just don't visit it. I'm not at all disturbed by swearwords, but that is just me. Other people might get very upset by swearwords (for whatever definition of "swearwords") and other*other people might get even more upset by censorship of swearwords: to everyone their own I'd say.

        But the only way your swearword filter is going to work is by including a dictionary of all known swearwords in all languages and I'm afraid such dictionary has not yet been compiled.

        And please, don't call me a rotten apple. If you were not so blinded by the swearword issue, you would have noticed I was one of the first to post a solution to your problem.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: array of MySQL data for substitution
by tlm (Prior) on Mar 29, 2005 at 20:26 UTC

    I don't know why you are getting no substitutions, but if I were doing this, instead of storing words in the DB, I'd store regexp strings (e.g. '\bfudge(?:d up)?\b') that I would then convert to regexps using qr; e.g.:

    while(my $ref = $sth->fetchrow_arrayref) { push @sw_regexps, qr($ref->[0]); }

    Also, you probably want the /g modifier in that s///

    the lowliest monk

Re: array of MySQL data for substitution
by Cody Pendant (Prior) on Mar 29, 2005 at 23:13 UTC
    Leaving aside the interesting discussion of censorship (I wrote something rather like this for a kids' website myself -- it's different when you're dealing with kids and their parents... one complained to me that their kid had seen "the C-word" on our website. I was horrified and checked my code before realising they meant "crap"...) why has nobody said "Use A Module"?

    There are a bunch of solutions if you search CPAN, and they have all kinds of features like storing the rude words in ROT-13 so that you see "shpx" instead of "fuck", plus using soundalike detection so that more creative miscreants can't get away with "phuk" either.



    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
    =~y~b-v~a-z~s; print
Re: array of MySQL data for substitution
by Errto (Vicar) on Mar 30, 2005 at 05:13 UTC
    In the second snippet, you're running a substitution on the variable $text without assigning it the result of your query first. Is that what the actual code does?