in reply to Substitute 'bad words' with 'good words' according to lists

You're on the right track, but you have a few issues to deal with --

  1. Order matters -- some terms to be replaced are found within other terms to be replaced, so you may get different effects depending of what order the keys of %words are returned in.
  2. Intra-word matching -- this may be necessary (people making inappropriate compound words, but if you're trying to remove 'ass', you want to match 'kickass' and not 'assume'. It may be necessary to match on word boundries, if you want to only get terms on their own (and even that won't be perfect).
  3. Chaining effects -- because you're going through for each term, depending on your situation, if there's a possibility of a replacement resulting in a match for another term, you'll again be dependant on the order items are returned from %words.

What's the best solution? I have no idea. More efficient, I might be able to do, but what's the acceptable tradeoff between ease of adding new terms and other maintenance time, execution time, missed terms, incorrectly replaced terms, or whatever other parameters you might have.

I'd have done something like the following, if we were matching on whole words, and we didn't have the other issues I mentioned:

my $regex_string = "\b('. join ('|',keys %words).')\b'; my $regex = qr/$regex_string/; $txt =~ s/$regex/$words{$regex}/eg;

I think there's a module in CPAN that builds better regexes from lists, but I can't remember what it was called. (and depending on how often you rebuild the list, and the number of terms, available memory, etc, this might not be the best way for you)

Replies are listed 'Best First'.
Re^2: Substitute 'bad words' with 'good words' according to lists
by mulander (Monk) on Sep 25, 2005 at 18:52 UTC
    Thank you both for the time it took you to write these responses.

    pg I find your method interesting and agree that it is much more efficient than the one I came up with. Thanks for pointing me out a big mistake ( the number of iterations ).

    jhourcle I must agree with the things you mentioned as additional problems that must be taken care of. This node was created not to accomplish some script but to seek a simillar method to the php one. The code you added along with your post is 'exactly' the anwser I was seeking, as it shows that it can be done almost exactly as in php. I tried a simillar solution by joining a list of bad words and a list of good words, but that was obviusly wrong as I did not know how to replace the word matched by the regex with the correspoding good word, you showed me that it can be done with a hash, and now I see how blind I was before. Thank you again, and thank you both for the time it took you to read this node and share your ideas.