yacoubean has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

Excuse my frailties with perl, as I'm still learning.

I am writing a little script to remove bad words from a document, using a bad words list. Everything is working fine except for lines that have a bad word as the first word in the sentence. Also, it appears the problem may be that these sentences start with a double qoute. For example:

If I want to remove the word 'crud', this sentence doesn't match:

"Crud!", Travis said.

My code is stripping all punctuation (at least it appears to be), and I am using 'lc' on the input words to make sure my bad words match (all are lower case).

After I've removed all punctuation from the above sentence, it looks like this:

Crud Travix said

I can't figure out why the word 'crud' isn't matching in this case, as it is in my bad words list with no typos or anything dumb like that. All other words appear to be matching as expected, and 'crud' matches fine when it appears in other locations in the line. Here's some code:

#!/usr/bin/perl use strict; use warnings; ... while (<$book>) { my $line = $_; my $plainline = $line; $plainline =~ s/["'.,!?:;\-()[\]{}|\\\/]/ /g; #replace all punctua +tion with a space my @sentence = split(/ /,$plainline); foreach my $word (@sentence) { chomp($word); my $whichword = 0; # to track which badword was found foreach my $badword (@badwords) { if (lc($word) eq $badword) { my $newword = replaceword($whichword); #get a cleaner +word to replace the naughty word $line =~ s/($word)/$newword/i; } $whichword++; } } $cleanbook .= $line; }

If you have any suggestions that don't relate to my question, feel free to drop them in. TIA.

----------
My home on the web: http://www.techfeed.net/

Replies are listed 'Best First'.
Re: Match first word in line fails
by mr_mischief (Monsignor) on Jan 30, 2005 at 07:53 UTC
    Maybe I'm just daft tonight, as may be the case at after 1 AM my time. Perl has powerful features in its regex engine. I suggest using them and being done with it. Is there some reason I'm missing that the following doesn't work?

    #!/usr/bin/perl -w use strict; my %bad_words = ( 'crud' => 'dirt', 'crap' => 'dung', 'poo' => 'dung', 'wanker' => 'fool' ); my $bad_re = join '|', keys %bad_words; while ( <DATA> ) { s/\b($bad_re)\b/$bad_words{lc($1)}/ieg; print; } __DATA__ "Wanker!" said I. "Crap," says Travis. This is horse poo. Poo, I tell you! "Poo upon all the crud and crap the wanker could see."

    When I tested this, it did what I think is being sought. Other than the building of the wordlist hash and the displaying (or writing to file, or assigning to a variable) of the changed text, it's just one line outside the loop to get ready for the regex, and one inside the loop to do the work. Here's my output:

    "fool!" said I. "dung," says Travis. This is horse dung. dung, I tell you! "dung upon all the dirt and dung the fool could see."

    There's no need, as far as I can tell, for anything besides search and replace when what's wanted is search and replace.

    Update: Looking back at this, the /e on the s/// isn't necessary.



    Christopher E. Stith
      I would make it slightly more complicated, by calling replaceword("$1") in the substitution:
      sub replaceword { my $w = $_[0]; my $repl = $bad_words{lc($w)}; if (lc($w) eq $w) { lc($repl); } elsif (ucfirst($w) eq $w) { ucfirst($repl); } else { uc($repl); } }
        If getting the right case is important, some extra steps may be needed. I think this version of replaceword() would be sufficient, as either lowercase or ucfirst() would be the normal casing. All-uppercase as a default disallows strange mixed case renderings, so while it would preserve some effect, it's still not a complete solution.

        sub replaceword { my $w = $_[0]; my $repl = $bad_words{ lc( $w ) }; $repl = ucfirst( $repl ) if ( $w =~ /^[A-Z]/ ); return $repl; }

        Of course, this would require that the eval flag stay on the substitution in my previous post, contrary to the update which applies to the code as appears in that node.



        Christopher E. Stith
Re: Match first word in line fails
by holli (Abbot) on Jan 30, 2005 at 04:57 UTC
    IŽd rewrite your code as follows:
    #!/usr/bin/perl use strict; use warnings; my $badwords = join("|", @badwords); while (<$book>) { s/[\"'.,!?:;()[\]{}|\\\/-]/ /g; #replace all punctuation with a sp +ace s/\b($badwords)\b/replaceword($1)/eig; #replace all bad words, ign +ore case }
    Update:
    It will be much easier for us to tell, if you give us more sample data.

    holli, regexed monk
      Yes, but he did not want to remove punctuation from the results. So I guess the first line in the loop shoud be dropped too.
      I guess that's better, it finds more words than before, but it still isn't finding all of the words. But alas, I'm too tired to work on it more at this hour. I'll hit it more tomorrow and let you guys know.
      --------------
      My home on the web: http://www.techfeed.net/
Re: Match first word in line fails
by CountZero (Bishop) on Jan 30, 2005 at 14:51 UTC
    There is nothing wrong with your code. You are extracting all words and matching all badwords. I slightly adapted your code and added some print-statements to see what happens:
    use strict; use warnings; my @badwords = qw/crud crap poo wanker/; while (<DATA>) { my $line = $_; my $plainline = $line; $plainline =~ s/["'.,!?:;\-()[\]{}|\\\/]/ /g; #replace all punctua +tion with a space my @sentence = split(/ /,$plainline); print join '|', @sentence,"\n"; foreach my $word (@sentence) { chomp($word); print "\tChecking *$word*\n"; my $whichword = 0; # to track which badword was found foreach my $badword (@badwords) { if (lc($word) eq $badword) { print "\t\tFound $badword\n"; #my $newword = replaceword($whichword); #get a cleaner + word to replace the naughty word #$line =~ s/($word)/$newword/i; } $whichword++; } } #$cleanbook .= $line; } __DATA__ "Wanker!" said I. "Crap," says Travis. This is horse poo. Poo, I tell you! "Poo upon all the crud and crap the wanker could see."

    And the output of this is:

    |Wanker|||said|I| | Checking ** Checking *Wanker* Found wanker Checking ** Checking ** Checking *said* Checking *I* Checking ** |Crap|||says|Travis| | Checking ** Checking *Crap* Found crap Checking ** Checking ** Checking *says* Checking *Travis* Checking ** This|is|horse|poo||Poo||I|tell|you| | Checking *This* Checking *is* Checking *horse* Checking *poo* Found poo Checking ** Checking *Poo* Found poo Checking ** Checking *I* Checking *tell* Checking *you* Checking ** |Poo|upon|all|the|crud|and|crap|the|wanker|could|see|| | Checking ** Checking *Poo* Found poo Checking *upon* Checking *all* Checking *the* Checking *crud* Found crud Checking *and* Checking *crap* Found crap Checking *the* Checking *wanker* Found wanker Checking *could* Checking *see* Checking ** Checking **
    Maybe the error is in your replaceword sub?

    As you see, you have lots of empty items in your wordslist since you replace punctuation by spaces and then split on spaces, effectively dropping all spaces but making a lot of empty "words".

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Match first word in line fails
by Popcorn Dave (Abbot) on Jan 30, 2005 at 07:02 UTC
    Here's another way to approach the problem. I'm not sure it's a better way than holli layed out for you but if you've got a bunch of words to replace I'd think a hash might be the way to go. Unfortunately this doesn't save your puncuation, and tonight I'm at a loss on how to split and save puncuation.

    With that said, here's my attempt:

    #!/usr/bin/perl use strict; use warnings; my ($line, $cleanbook); my %badwords = (crud => "darn", shit => "shoot"); while ($line = <DATA>) { $line =~ s/["'.,!?:;\-()[\]{}|\\\/]/ /g; #replace all punctuation +with a space my @sentence = split(/\W+/,$line); @sentence = map {$badwords{lc($_)}?$badwords{lc($_)}:$_} @sentence +; $cleanbook .= join(' ', @sentence)."\n"; } print $cleanbook; __DATA__ "Crud," said Travis "Shit, this crud is shit!"
    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.

      Change this:

      my @sentence = split(/\W+/,$line);

      to:

      my @sentence = split(/(\W+)/,$line);

      Now @sentence will contain the punctuation as well.

Re: Match first word in line fails
by Anonymous Monk on Jan 31, 2005 at 00:33 UTC
    This might help:

    use Regexp::Common qw(profanity); my %bad2good = (...); while(<>) { s/($RE{profanity})/$bad2good{lc $1}/gi; print; }

    You'll have to come up with your own bad word replacement list. If the regexp doesn't work for you as is, you can either edit the module as you like, or submit a bug report.

Re: Match first word in line fails
by frostman (Beadle) on Jan 31, 2005 at 01:34 UTC

    Well yacoubean, I wish I could be more help but it's 2 in the morning and all I could think of is the following, which is food for thought at best.

    #!/usr/bin/perl # hmm, try bleeping... use strict; my @badwords = qw(crud poop darn); my $cleanbook; # i wonder if it will work... my $match = join("|",@badwords); # lazy, not checking badwords. while (<DATA>) { # ahh, inefficiency! while ($_ =~ /$match/i) { my $bad = $&; (my $bloop = $bad) =~ s/(?<=\w)\w/\*/sg; $_ =~ s/$bad/$bloop/sg; } $cleanbook .= $_; } print $cleanbook; __DATA__ "Crud," said Travis. But then, Travis was always something of a poopy-head. He could never keep to his darned self. Darn it Travis!

    This gives me:

    [frost@louddrunk]$ perl t.pl "C***," said Travis. But then, Travis was always something of a p***y-head. He could never keep to his d***ed self. D*** it Travis!

    ...which may or may not be useful in your case. I was just thinking that if I were in your shoes I'd probably rather bl**p the profanities than replace them with gentler versions.

    Erasing them requires too much skill with Grammar::Parse, and replacing them with other words is, well, putting words in people's mouths.

    On the other hand, the profane among us are quite used to the occasional obfuscation...

      Oh, and of course you can make it a little less bleepy like so:

      (my $bloop = $bad) =~ s/(?<=\w{2})\w/\*/sg;

      ...which I think makes it more legible:

      "Cr**," said Travis. But then, Travis was always something of a po**y-head. He could never keep to his da**ed self. Da** it Travis!
Re: Match first word in line fails
by yacoubean (Scribe) on Jan 31, 2005 at 16:52 UTC
    Wow! I didn't have time to mess with this problem yesterday, but I see that you guys went crazy and gave me some excellent tips! I'm back at work now, so I'll work on it more tonight, then let you know who won. ;)
    --------------
    My home on the web: http://www.techfeed.net/
      Again, thanks a bunch guys. I ended up using mr_mischief's and ysth's suggestions. The other thing I didn't mention here is that I want to keep stats on the potty language that's replaced, so I added that in and now it all works the way I wanted! :)
      --------------
      My home on the web: http://www.techfeed.net/