Re: Match first word in line fails
by mr_mischief (Monsignor) on Jan 30, 2005 at 07:53 UTC
|
Maybe I'm just daft tonight, as may be the case at after 1 AM my time. Perl has powerful features in its regex engine. I suggest using them and being done with it. Is there some reason I'm missing that the following doesn't work?
#!/usr/bin/perl -w
use strict;
my %bad_words = (
'crud' => 'dirt',
'crap' => 'dung',
'poo' => 'dung',
'wanker' => 'fool'
);
my $bad_re = join '|', keys %bad_words;
while ( <DATA> ) {
s/\b($bad_re)\b/$bad_words{lc($1)}/ieg;
print;
}
__DATA__
"Wanker!" said I.
"Crap," says Travis.
This is horse poo. Poo, I tell you!
"Poo upon all the crud and crap the wanker could see."
When I tested this, it did what I think is being sought. Other than the building of the wordlist hash and the displaying (or writing to file, or assigning to a variable) of the changed text, it's just one line outside the loop to get ready for the regex, and one inside the loop to do the work. Here's my output:
"fool!" said I.
"dung," says Travis.
This is horse dung. dung, I tell you!
"dung upon all the dirt and dung the fool could see."
There's no need, as far as I can tell, for anything besides search and replace when what's wanted is search and replace.
Update: Looking back at this, the /e on the s/// isn't necessary.
| [reply] [d/l] [select] |
|
|
I would make it slightly more complicated, by calling replaceword("$1") in the substitution:
sub replaceword
{
my $w = $_[0];
my $repl = $bad_words{lc($w)};
if (lc($w) eq $w) {
lc($repl);
} elsif (ucfirst($w) eq $w) {
ucfirst($repl);
} else {
uc($repl);
}
}
| [reply] [d/l] |
|
|
If getting the right case is important, some extra steps may be needed. I think this version of replaceword() would be sufficient, as either lowercase or ucfirst() would be the normal casing. All-uppercase as a default disallows strange mixed case renderings, so while it would preserve some effect, it's still not a complete solution.
sub replaceword {
my $w = $_[0];
my $repl = $bad_words{ lc( $w ) };
$repl = ucfirst( $repl ) if ( $w =~ /^[A-Z]/ );
return $repl;
}
Of course, this would require that the eval flag stay on the substitution in my previous post, contrary to the update which applies to the code as appears in that node.
| [reply] [d/l] |
|
|
Re: Match first word in line fails
by holli (Abbot) on Jan 30, 2005 at 04:57 UTC
|
IŽd rewrite your code as follows:
#!/usr/bin/perl
use strict;
use warnings;
my $badwords = join("|", @badwords);
while (<$book>)
{
s/[\"'.,!?:;()[\]{}|\\\/-]/ /g; #replace all punctuation with a sp
+ace
s/\b($badwords)\b/replaceword($1)/eig; #replace all bad words, ign
+ore case
}
Update:
It will be much easier for us to tell, if you give us more sample data.
| [reply] [d/l] |
|
|
Yes, but he did not want to remove punctuation from the results. So I guess the first line in the loop shoud be dropped too.
| [reply] |
|
|
I guess that's better, it finds more words than before, but it still isn't finding all of the words. But alas, I'm too tired to work on it more at this hour. I'll hit it more tomorrow and let you guys know.
| [reply] |
Re: Match first word in line fails
by CountZero (Bishop) on Jan 30, 2005 at 14:51 UTC
|
There is nothing wrong with your code. You are extracting all words and matching all badwords. I slightly adapted your code and added some print-statements to see what happens:
use strict;
use warnings;
my @badwords = qw/crud crap poo wanker/;
while (<DATA>) {
my $line = $_;
my $plainline = $line;
$plainline =~ s/["'.,!?:;\-()[\]{}|\\\/]/ /g; #replace all punctua
+tion with a space
my @sentence = split(/ /,$plainline);
print join '|', @sentence,"\n";
foreach my $word (@sentence) {
chomp($word);
print "\tChecking *$word*\n";
my $whichword = 0; # to track which badword was found
foreach my $badword (@badwords) {
if (lc($word) eq $badword) {
print "\t\tFound $badword\n";
#my $newword = replaceword($whichword); #get a cleaner
+ word to replace the naughty word
#$line =~ s/($word)/$newword/i;
}
$whichword++;
}
}
#$cleanbook .= $line;
}
__DATA__
"Wanker!" said I.
"Crap," says Travis.
This is horse poo. Poo, I tell you!
"Poo upon all the crud and crap the wanker could see."
And the output of this is: |Wanker|||said|I|
|
Checking **
Checking *Wanker*
Found wanker
Checking **
Checking **
Checking *said*
Checking *I*
Checking **
|Crap|||says|Travis|
|
Checking **
Checking *Crap*
Found crap
Checking **
Checking **
Checking *says*
Checking *Travis*
Checking **
This|is|horse|poo||Poo||I|tell|you|
|
Checking *This*
Checking *is*
Checking *horse*
Checking *poo*
Found poo
Checking **
Checking *Poo*
Found poo
Checking **
Checking *I*
Checking *tell*
Checking *you*
Checking **
|Poo|upon|all|the|crud|and|crap|the|wanker|could|see||
|
Checking **
Checking *Poo*
Found poo
Checking *upon*
Checking *all*
Checking *the*
Checking *crud*
Found crud
Checking *and*
Checking *crap*
Found crap
Checking *the*
Checking *wanker*
Found wanker
Checking *could*
Checking *see*
Checking **
Checking **
Maybe the error is in your replaceword sub?As you see, you have lots of empty items in your wordslist since you replace punctuation by spaces and then split on spaces, effectively dropping all spaces but making a lot of empty "words".
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Match first word in line fails
by Popcorn Dave (Abbot) on Jan 30, 2005 at 07:02 UTC
|
Here's another way to approach the problem. I'm not sure it's a better way than holli layed out for you but if you've got a bunch of words to replace I'd think a hash might be the way to go. Unfortunately this doesn't save your puncuation, and tonight I'm at a loss on how to split and save puncuation.
With that said, here's my attempt:
#!/usr/bin/perl
use strict;
use warnings;
my ($line, $cleanbook);
my %badwords = (crud => "darn", shit => "shoot");
while ($line = <DATA>) {
$line =~ s/["'.,!?:;\-()[\]{}|\\\/]/ /g; #replace all punctuation
+with a space
my @sentence = split(/\W+/,$line);
@sentence = map {$badwords{lc($_)}?$badwords{lc($_)}:$_} @sentence
+;
$cleanbook .= join(' ', @sentence)."\n";
}
print $cleanbook;
__DATA__
"Crud," said Travis
"Shit, this crud is shit!"
Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
| [reply] [d/l] |
|
|
Change this:
my @sentence = split(/\W+/,$line);
to:
my @sentence = split(/(\W+)/,$line);
Now @sentence will contain the punctuation as well.
| [reply] [d/l] [select] |
Re: Match first word in line fails
by Anonymous Monk on Jan 31, 2005 at 00:33 UTC
|
use Regexp::Common qw(profanity);
my %bad2good = (...);
while(<>) {
s/($RE{profanity})/$bad2good{lc $1}/gi;
print;
}
You'll have to come up with your own bad word replacement list. If the regexp doesn't work for you as is, you can either edit the module as you like, or submit a bug report. | [reply] [d/l] |
Re: Match first word in line fails
by frostman (Beadle) on Jan 31, 2005 at 01:34 UTC
|
Well yacoubean, I wish I could be more help but it's 2 in the morning and all I could think of is the following, which is food for thought at best.
#!/usr/bin/perl
# hmm, try bleeping...
use strict;
my @badwords = qw(crud poop darn);
my $cleanbook;
# i wonder if it will work...
my $match = join("|",@badwords); # lazy, not checking badwords.
while (<DATA>) {
# ahh, inefficiency!
while ($_ =~ /$match/i) {
my $bad = $&;
(my $bloop = $bad) =~ s/(?<=\w)\w/\*/sg;
$_ =~ s/$bad/$bloop/sg;
}
$cleanbook .= $_;
}
print $cleanbook;
__DATA__
"Crud," said Travis. But then, Travis was
always something of a poopy-head. He could
never keep to his darned self. Darn it Travis!
This gives me:
[frost@louddrunk]$ perl t.pl
"C***," said Travis. But then, Travis was
always something of a p***y-head. He could
never keep to his d***ed self. D*** it Travis!
...which may or may not be useful in your case. I was just thinking
that if I were in your shoes I'd probably rather bl**p the profanities
than replace them with gentler versions.
Erasing them requires too much skill with Grammar::Parse,
and replacing them with other words is, well, putting words in
people's mouths.
On the other hand, the profane among us are quite used to the occasional
obfuscation...
| [reply] [d/l] [select] |
|
|
(my $bloop = $bad) =~ s/(?<=\w{2})\w/\*/sg;
...which I think makes it more legible:
"Cr**," said Travis. But then, Travis was
always something of a po**y-head. He could
never keep to his da**ed self. Da** it Travis!
| [reply] [d/l] [select] |
Re: Match first word in line fails
by yacoubean (Scribe) on Jan 31, 2005 at 16:52 UTC
|
Wow! I didn't have time to mess with this problem yesterday, but I see that you guys went crazy and gave me some excellent tips! I'm back at work now, so I'll work on it more tonight, then let you know who won. ;)
| [reply] |
|
|
Again, thanks a bunch guys. I ended up using mr_mischief's and ysth's suggestions. The other thing I didn't mention here is that I want to keep stats on the potty language that's replaced, so I added that in and now it all works the way I wanted! :)
| [reply] |