in reply to Bad Language In Contact Messages

My preference is to set a variables to true or assign point system, then judge afterwards.
my $fail=0; my @ErrMsg; my @forbidden=('porn','sex'); if (map {$Message =~ /$_/i} @forbidden){ $fail=1; # or $fail++; } if ($fail){ # or ($fail > 2) push @ErrMsg,"Bad Language Not Permitted"; }

Replies are listed 'Best First'.
Re^2: Bad Language In Contact Messages
by Your Mother (Archbishop) on Jul 09, 2022 at 14:11 UTC

    This problem space is pretty difficult to get right even in naïve cases.

    ack -i sex /usr/share/dict/words | nl 1 ambosexous # ... 45 misexplain # ... 199 Wessexman

    Quite a few false positives there and it will pass through these and their variations or the even more common emoji replacements for the topic I won’t include.

    s3x & pr0n
    s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝
    รєא คภ๔ ק๏гภ
    ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ
    𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗
    uɹod puɐ xǝs
    🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽
    🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀
    𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻

    Update: I ran those through Text::Unidecode to see if it would untangle them because it’s surprisingly good at that, usually. Not here. Best conversion on any of them was s[?]x [?]n[?] [?][?]rn. The rest were nowhere close.

      Okay, this is silly and I probably could have found a better way to spend the last couple hours but I was curious and surprised by Text::Unidecode’s lack of efficacy here so… (has to be <pre/> instead of <code/> tags because of content.)

      Posting it because I do these things so rarely I’ll lose the memory of how to do it if I don’t. :P

      #!/usr/bin/env perl
      use 5.14.0; # Not sure this is earliest/best.
      use utf8;
      use strictures;
      use open ":std", ":encoding(utf8)";
      use Unicode::UCD "charinfo";
      use charnames ":full";
      use YAML; # For introspection if you're curious.
      
      my @sex_and_porn = <DATA>;
      
      for my $sap ( @sex_and_porn )
      {
          chomp $sap;
          my @string;
          for my $chr ( split "", $sap )
          {
              next unless $chr =~ /\A[[:print:]]+\z/;
              my $info = charinfo(ord($chr));
              # print Dump($info);
              my $name = $info->{name};
              next if $name =~ /COMBINING/;
              my $replacement;
              if ( $name =~ /\b(SMALL|CAPITAL)?(?: LETTER)?(?: TURNED| SHARP)? ([A-Z])\b/ )
              {
                  no warnings "uninitialized";
                  my $case = $1 eq "SMALL" ? "SMALL" : "CAPITAL";
                  $replacement = "LATIN $case LETTER $2";
              }
      
              my $final = charnames::string_vianame( $replacement || $name );
      
              push @string, $final =~ /(?:\b[A-Z]\b| )/i ?
                  $final : "\x{0}";
          }
          s/\x{0}+/[?]/g for @string;
          printf " %15s -> %s\n\n", $sap, join "", @string;
      }
      
      __DATA__
      s3x & pr0n
      s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝
      รєא คภ๔ ק๏гภ
      ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ
      𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗
      uɹod puɐ xǝs
      🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽
      🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀
      𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻 
      ßêx and pørñ
      ẞӬ𐊜 AƝḊ ⓅỚℝꞐ
      

      Output–

            s3x & pr0n -> s[?]x [?] pr[?]n
      
       s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝ -> sex and porn
      
          รєא คภ๔ ק๏гภ -> [?][?][?] [?][?][?] [?][?][?][?]
      
          ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ -> sex and porn
      
          𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗 -> sex and porn
      
          uɹod puɐ xǝs -> urod pua xes
      
          🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽 -> SEX AND PORN
      
       🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀 -> [?] sex and p[?]rn [?]
      
         𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻  -> sex and porn 
      
          ßêx and pørñ -> sex and porn
      
          ẞӬ𐊜 AƝḊ ⓅỚℝꞐ -> SEX AND PORN
      
      

      And it gets worse if you have to support multiple languages. If the company is in North America, they can probably get away with supporting 3 different languages (spanish, french and english).

      If the company in the EU, there are 24 official languages at last count: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.

      You might get away with supporting fewer languages in some private sectors, but if you are running a government system (or a system that wants to receive funding from the government), you might have to support all of those. And as anyone can imagine, a "bad" word in one of those languages might be a standard word with a different meaning in another language.

      Plus, even for countries speaking the same language, there might be big cultural differences. For U.S. citizens, a "tea party" means throwing valuable cargo into the harbor because they endorse "taxation without representation" but want to keep all the taxes for their own government(*). For Brits, it means "5 p.m."

      (*) Puerto Rico

      PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP