Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Substitute 'bad words' with 'good words' according to lists

by mulander (Monk)
on Sep 25, 2005 at 18:08 UTC ( [id://494940]=perlquestion: print w/replies, xml ) Need Help??

mulander has asked for the wisdom of the Perl Monks concerning the following question:

wHellow fellow monks.
A friend asked me recently if it is possible in Perl to simulate this code from php:
$bad_words = array('ugly', 'anotherugly'); $good_words = array('ug**','anot******y'); $txt = 'ugly anotherugly'; $txt str_replace($bad_words, $good_words, $txt);
In this code he first creates a list of bad_words that he want's to search for, then creates a list of good_words to subsitute instead of the found bad ones. He gives the str_replace the bad and good words only telling him on what string it should work on.
I came up with this solution:
#!/usr/bin/perl use warnings; use strict; my %words = ( ugly => 'ug**', anotherugly => 'anot*******', ); my $txt = "ugly anotherugly"; $txt =~ s/$_/$words{$_}/g foreach %words; print $txt;
It works fine, but I am wondering about a few things, that are beyond my kwnoledge.
1. Is it the most efficient way?
2. How is it done by php's str_replace, does it iterate over each bad word as my foreach loop iterates over %words?
3. Is it at least that fast as the use of php's str_replace?
4. Is there a better way to do it?

I am almost sure that it can be done much better ( more efficient ) in Perl than I did. I tried to search the perldoc but with no luck. Maybe some of you fellow Monks had expirience with such construct's ( maybe those that work with CGI ? ) and have a relatively good equivalent for the php code.
I await your replies, and hope to see some interesting solutions, or just simple ideas how to solve this problem?

Replies are listed 'Best First'.
Re: Substitute 'bad words' with 'good words' according to lists
by jhourcle (Prior) on Sep 25, 2005 at 18:35 UTC

    You're on the right track, but you have a few issues to deal with --

    1. Order matters -- some terms to be replaced are found within other terms to be replaced, so you may get different effects depending of what order the keys of %words are returned in.
    2. Intra-word matching -- this may be necessary (people making inappropriate compound words, but if you're trying to remove 'ass', you want to match 'kickass' and not 'assume'. It may be necessary to match on word boundries, if you want to only get terms on their own (and even that won't be perfect).
    3. Chaining effects -- because you're going through for each term, depending on your situation, if there's a possibility of a replacement resulting in a match for another term, you'll again be dependant on the order items are returned from %words.

    What's the best solution? I have no idea. More efficient, I might be able to do, but what's the acceptable tradeoff between ease of adding new terms and other maintenance time, execution time, missed terms, incorrectly replaced terms, or whatever other parameters you might have.

    I'd have done something like the following, if we were matching on whole words, and we didn't have the other issues I mentioned:

    my $regex_string = "\b('. join ('|',keys %words).')\b'; my $regex = qr/$regex_string/; $txt =~ s/$regex/$words{$regex}/eg;

    I think there's a module in CPAN that builds better regexes from lists, but I can't remember what it was called. (and depending on how often you rebuild the list, and the number of terms, available memory, etc, this might not be the best way for you)

      Thank you both for the time it took you to write these responses.

      pg I find your method interesting and agree that it is much more efficient than the one I came up with. Thanks for pointing me out a big mistake ( the number of iterations ).

      jhourcle I must agree with the things you mentioned as additional problems that must be taken care of. This node was created not to accomplish some script but to seek a simillar method to the php one. The code you added along with your post is 'exactly' the anwser I was seeking, as it shows that it can be done almost exactly as in php. I tried a simillar solution by joining a list of bad words and a list of good words, but that was obviusly wrong as I did not know how to replace the word matched by the regex with the correspoding good word, you showed me that it can be done with a hash, and now I see how blind I was before. Thank you again, and thank you both for the time it took you to read this node and share your ideas.
Re: Substitute 'bad words' with 'good words' according to lists
by pg (Canon) on Sep 25, 2005 at 18:21 UTC

    It is not the most efficient way, as you are going through the entire sentence for each key work. Say you have n key word, and the sentence contains m words, you are checking n*m times.

    One of the better way is to split the sentence into words, go through the list, and see whether each word exists in the hash (hash search is nothing). This only requires to go through the sentence three times: 1) once to split. 2) once to replace, and 3) if you wish to count this one, to join it back.

    use strict; use warnings; my %words = ( ugly => 'ug**', anotherugly => 'anot*******', ); my $txt = "ugly anotherugly"; my @words = split / /, $txt; # largely simplified, you have to count , +.:; etc for my $i (0 .. $#words) { $words[$i] = $words{$words[$i]} if (exists($words{$words[$i]})) } print join(' ', @words);
      pg,

      I think the original code is not necessarily inefficient.I feel the performance depends on number of words your split returns.. Here is a benchmark of the original (added  keys which was missing). I have modifed the txt to be 100 times the original one.

      Again the story could be different when you have way too many replacements and fewer words.

      #!/usr/bin/perl use strict; use warnings; use Benchmark qw (:all); my $txt = "ugly anotherugly " x 100; # print $txt,$/; sub pg { my %words = ( ugly => 'ug**', anotherugly => 'anot*******', ); my @words = split / /, $txt; # largely simplified, you have to cou +nt ,.:; etc for my $i (0 .. $#words) { $words[$i] = $words{$words[$i]} if (exists($words{$words[$ +i]})) } # print join(' ', @words),$/; } sub orig { my %words = ( ugly => 'ug**', anotherugly => 'anot*******', ); $txt =~ s/$_/$words{$_}/g foreach keys(%words); # print $txt,$/; } my $test = {'pg' => \&pg, 'Original' =>\&orig,}; my $result = timethese(-10,$test ); cmpthese($result);

      Output

      Benchmark: running Original, pg for at least 10 CPU seconds... Original: 11 wallclock secs (10.86 usr + 0.00 sys = 10.86 CPU) @ 43 +770.26/s (n=475345) pg: 11 wallclock secs (10.68 usr + 0.00 sys = 10.68 CPU) @ 43 +28.46/s (n=46228) Rate pg Original pg 4328/s -- -90% Original 43770/s 911% --

      NOTE: I removed the join from your code just to show the looping differences.

        You are right, and thanks for pointing out. My original analysis took the assumption that both s/// and split iterate through the sentence with the same performance, however that was wrong, and split() is much slower:

        use strict; use warnings; use Benchmark qw (:all); my $txt = "a" x 100; sub seperate { split //, $txt; } sub replace { $txt =~ s/a/b/g; } my $result = timethese(100000, {'seperate' => \&seperate, 'replace' => + \&replace});

        This gives:

        Benchmark: timing 10000 iterations of replace, seperate... replace: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) seperate: 2 wallclock secs ( 1.20 usr + 0.00 sys = 1.20 CPU) @ 83 +05.65/s (n =10000) C:\Perl\bin>perl -w math1.pl Benchmark: timing 100000 iterations of replace, seperate... replace: 1 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 31 +25000.00/s (n=100000) (warning: too few iterations for a reliable count) seperate: 16 wallclock secs (12.50 usr + 0.00 sys = 12.50 CPU) @ 80 +00.00/s (n =100000)
Re: Substitute 'bad words' with 'good words' according to lists
by derby (Abbot) on Sep 25, 2005 at 22:26 UTC
Re: Substitute 'bad words' with 'good words' according to lists
by monarch (Priest) on Sep 25, 2005 at 23:33 UTC
    This just gave me a brilliant idea; why don't corporate firewalls just ROT13 any incoming emails with "detected bad language" rather than just rejecting the email all together..
Re: Substitute 'bad words' with 'good words' according to lists
by TedPride (Priest) on Sep 26, 2005 at 00:53 UTC
    Because the object of the firewall is spam filtering more than anything else, and if you let the email through, you're not reducing the amount of spam. Personally, I prefer using Yahoo for my email. Yahoo has the advantage of access to a huge sampling of emails and user input, so it can much better decide whether an email (or host) is spam or not than the average corporate firewall.
      If that's so, then why does my GMail account correctly filter 99% of my 30+ spams/day while Yahoo fails to filter more than 80% of the 50+ spams/day that account receives? Hmm ...

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      Unfortunately, in my experience, Yahoo! generates far too many false positives. I wind up having to go through their bulk mail folder to find messages incorrectly identified as spam.

Re: Substitute 'bad words' with 'good words' according to lists
by mulander (Monk) on Sep 26, 2005 at 05:35 UTC
    sk thanks for the effort of Benchmarking these codes.

    We can of course assume that the wordlist will be medium sized, but the txt can get unlimited length ( of course it won't be created of bad words only ).

    But I really wonder how efficient all of these methods (those that appeared in this node ) are when compared to the php code I supplied in my first writeup.

    I think that php has this method built-in in the core of the language, so it could be hard to beat ( if they did not mess it up ;P ). I suppose there is not a realible way to benchmark php code vs perl code ? Or maybe am I wrong and it can be done?

    Another thing, does any body know how str_replace(); exactly does it's magic while replacing those lists? How many iteration does it do? Or maybe it is done in one go?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://494940]
Approved by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-20 14:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found