You're on the right track, but you have a few issues to deal with --

  1. Order matters -- some terms to be replaced are found within other terms to be replaced, so you may get different effects depending of what order the keys of %words are returned in.
  2. Intra-word matching -- this may be necessary (people making inappropriate compound words, but if you're trying to remove 'ass', you want to match 'kickass' and not 'assume'. It may be necessary to match on word boundries, if you want to only get terms on their own (and even that won't be perfect).
  3. Chaining effects -- because you're going through for each term, depending on your situation, if there's a possibility of a replacement resulting in a match for another term, you'll again be dependant on the order items are returned from %words.

What's the best solution? I have no idea. More efficient, I might be able to do, but what's the acceptable tradeoff between ease of adding new terms and other maintenance time, execution time, missed terms, incorrectly replaced terms, or whatever other parameters you might have.

I'd have done something like the following, if we were matching on whole words, and we didn't have the other issues I mentioned:

my $regex_string = "\b('. join ('|',keys %words).')\b'; my $regex = qr/$regex_string/; $txt =~ s/$regex/$words{$regex}/eg;

I think there's a module in CPAN that builds better regexes from lists, but I can't remember what it was called. (and depending on how often you rebuild the list, and the number of terms, available memory, etc, this might not be the best way for you)


In reply to Re: Substitute 'bad words' with 'good words' according to lists by jhourcle
in thread Substitute 'bad words' with 'good words' according to lists by mulander

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.