Substitute 'bad words' with 'good words' according to lists

mulander has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Substitute 'bad words' with 'good words' according to lists by jhourcle (Prior) on Sep 25, 2005 at 18:35 UTC
You're on the right track, but you have a few issues to deal with -- Order matters -- some terms to be replaced are found within other terms to be replaced, so you may get different effects depending of what order the keys of %words are returned in. Intra-word matching -- this may be necessary (people making inappropriate compound words, but if you're trying to remove 'ass', you want to match 'kickass' and not 'assume'. It may be necessary to match on word boundries, if you want to only get terms on their own (and even that won't be perfect). Chaining effects -- because you're going through for each term, depending on your situation, if there's a possibility of a replacement resulting in a match for another term, you'll again be dependant on the order items are returned from %words. What's the best solution? I have no idea. More efficient, I might be able to do, but what's the acceptable tradeoff between ease of adding new terms and other maintenance time, execution time, missed terms, incorrectly replaced terms, or whatever other parameters you might have. I'd have done something like the following, if we were matching on whole words, and we didn't have the other issues I mentioned: `my $regex_string = "\b('. join ('\|',keys %words).')\b'; my $regex = qr/$regex_string/; $txt =~ s/$regex/$words{$regex}/eg;` [download] I think there's a module in CPAN that builds better regexes from lists, but I can't remember what it was called. (and depending on how often you rebuild the list, and the number of terms, available memory, etc, this might not be the best way for you)	[reply] [d/l]
Re^2: Substitute 'bad words' with 'good words' according to lists by mulander (Monk) on Sep 25, 2005 at 18:52 UTC
Thank you both for the time it took you to write these responses. pg I find your method interesting and agree that it is much more efficient than the one I came up with. Thanks for pointing me out a big mistake ( the number of iterations ). jhourcle I must agree with the things you mentioned as additional problems that must be taken care of. This node was created not to accomplish some script but to seek a simillar method to the php one. The code you added along with your post is 'exactly' the anwser I was seeking, as it shows that it can be done almost exactly as in php. I tried a simillar solution by joining a list of bad words and a list of good words, but that was obviusly wrong as I did not know how to replace the word matched by the regex with the correspoding good word, you showed me that it can be done with a hash, and now I see how blind I was before. Thank you again, and thank you both for the time it took you to read this node and share your ideas.	[reply]
Re: Substitute 'bad words' with 'good words' according to lists by pg (Canon) on Sep 25, 2005 at 18:21 UTC
It is not the most efficient way, as you are going through the entire sentence for each key work. Say you have n key word, and the sentence contains m words, you are checking nm times. One of the better way is to split the sentence into words, go through the list, and see whether each word exists in the hash (hash search is nothing). This only requires to go through the sentence three times: 1) once to split. 2) once to replace, and 3) if you wish to count this one, to join it back. `use strict; use warnings; my %words = ( ugly => 'ug', anotherugly => 'anot******', ); my $txt = "ugly anotherugly"; my @words = split / /, $txt; # largely simplified, you have to count , +.:; etc for my $i (0 .. $#words) { $words[$i] = $words{$words[$i]} if (exists($words{$words[$i]})) } print join(' ', @words);` [download]	[reply] [d/l]
Re^2: Substitute 'bad words' with 'good words' according to lists by sk (Curate) on Sep 26, 2005 at 03:00 UTC
pg, I think the original code is not necessarily inefficient.I feel the performance depends on number of words your split returns.. Here is a benchmark of the original (added `keys` which was missing). I have modifed the txt to be 100 times the original one. Again the story could be different when you have way too many replacements and fewer words. #!/usr/bin/perl use strict; use warnings; use Benchmark qw (:all); my $txt = "ugly anotherugly " x 100; # print $txt,$/; sub pg { my %words = ( ugly => 'ug', anotherugly => 'anot***', ); my @words = split / /, $txt; # largely simplified, you have to cou +nt ,.:; etc for my $i (0 .. $#words) { $words[$i] = $words{$words[$i]} if (exists($words{$words[$ +i]})) } # print join(' ', @words),$/; } sub orig { my %words = ( ugly => 'ug', anotherugly => 'anot*****', ); $txt =~ s/$_/$words{$_}/g foreach keys(%words); # print $txt,$/; } my $test = {'pg' => \&pg, 'Original' =>\&orig,}; my $result = timethese(-10,$test ); cmpthese($result); [download] Output** `Benchmark: running Original, pg for at least 10 CPU seconds... Original: 11 wallclock secs (10.86 usr + 0.00 sys = 10.86 CPU) @ 43 +770.26/s (n=475345) pg: 11 wallclock secs (10.68 usr + 0.00 sys = 10.68 CPU) @ 43 +28.46/s (n=46228) Rate pg Original pg 4328/s -- -90% Original 43770/s 911% --` [download] NOTE: I removed the `join` from your code just to show the looping differences.	[reply] [d/l] [select]
Re^3: Substitute 'bad words' with 'good words' according to lists by pg (Canon) on Sep 26, 2005 at 05:14 UTC
You are right, and thanks for pointing out. My original analysis took the assumption that both s/// and split iterate through the sentence with the same performance, however that was wrong, and split() is much slower: `use strict; use warnings; use Benchmark qw (:all); my $txt = "a" x 100; sub seperate { split //, $txt; } sub replace { $txt =~ s/a/b/g; } my $result = timethese(100000, {'seperate' => \&seperate, 'replace' => + \&replace});` [download] This gives: Benchmark: timing 10000 iterations of replace, seperate... replace: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) seperate: 2 wallclock secs ( 1.20 usr + 0.00 sys = 1.20 CPU) @ 83 +05.65/s (n =10000) C:\Perl\bin>perl -w math1.pl Benchmark: timing 100000 iterations of replace, seperate... replace: 1 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 31 +25000.00/s (n=100000) (warning: too few iterations for a reliable count) seperate: 16 wallclock secs (12.50 usr + 0.00 sys = 12.50 CPU) @ 80 +00.00/s (n =100000) [download]	[reply] [d/l] [select]
Re: Substitute 'bad words' with 'good words' according to lists by derby (Abbot) on Sep 25, 2005 at 22:26 UTC
Acme::Flanders -derby	[reply]
Re: Substitute 'bad words' with 'good words' according to lists by monarch (Priest) on Sep 25, 2005 at 23:33 UTC
This just gave me a brilliant idea; why don't corporate firewalls just ROT13 any incoming emails with "detected bad language" rather than just rejecting the email all together..	[reply]
Re: Substitute 'bad words' with 'good words' according to lists by TedPride (Priest) on Sep 26, 2005 at 00:53 UTC
Because the object of the firewall is spam filtering more than anything else, and if you let the email through, you're not reducing the amount of spam. Personally, I prefer using Yahoo for my email. Yahoo has the advantage of access to a huge sampling of emails and user input, so it can much better decide whether an email (or host) is spam or not than the average corporate firewall.	[reply]
OT: Re^2: Substitute 'bad words' with 'good words' according to lists by dragonchild (Archbishop) on Sep 26, 2005 at 01:47 UTC
If that's so, then why does my GMail account correctly filter 99% of my 30+ spams/day while Yahoo fails to filter more than 80% of the 50+ spams/day that account receives? Hmm ... My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: Substitute 'bad words' with 'good words' according to lists by spiritway (Vicar) on Sep 26, 2005 at 01:51 UTC
Unfortunately, in my experience, Yahoo! generates far too many false positives. I wind up having to go through their bulk mail folder to find messages incorrectly identified as spam.	[reply]
Re: Substitute 'bad words' with 'good words' according to lists by mulander (Monk) on Sep 26, 2005 at 05:35 UTC
sk thanks for the effort of Benchmarking these codes. We can of course assume that the wordlist will be medium sized, but the txt can get unlimited length ( of course it won't be created of bad words only ). But I really wonder how efficient all of these methods (those that appeared in this node ) are when compared to the php code I supplied in my first writeup. I think that php has this method built-in in the core of the language, so it could be hard to beat ( if they did not mess it up ;P ). I suppose there is not a realible way to benchmark php code vs perl code ? Or maybe am I wrong and it can be done? Another thing, does any body know how str_replace(); exactly does it's magic while replacing those lists? How many iteration does it do? Or maybe it is done in one go?	[reply]


Perl: the Markov chain saw
	PerlMonks