in reply to Re: Bad Language In Contact Messages
in thread Bad Language In Contact Messages

This problem space is pretty difficult to get right even in naïve cases.

ack -i sex /usr/share/dict/words | nl 1 ambosexous # ... 45 misexplain # ... 199 Wessexman

Quite a few false positives there and it will pass through these and their variations or the even more common emoji replacements for the topic I won’t include.

s3x & pr0n
s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝
รєא คภ๔ ק๏гภ
ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ
𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗
uɹod puɐ xǝs
🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽
🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀
𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻

Update: I ran those through Text::Unidecode to see if it would untangle them because it’s surprisingly good at that, usually. Not here. Best conversion on any of them was s[?]x [?]n[?] [?][?]rn. The rest were nowhere close.

Replies are listed 'Best First'.
Re^3: Bad Language In Contact Messages
by Your Mother (Archbishop) on Jul 09, 2022 at 20:52 UTC

    Okay, this is silly and I probably could have found a better way to spend the last couple hours but I was curious and surprised by Text::Unidecode’s lack of efficacy here so… (has to be <pre/> instead of <code/> tags because of content.)

    Posting it because I do these things so rarely I’ll lose the memory of how to do it if I don’t. :P

    #!/usr/bin/env perl
    use 5.14.0; # Not sure this is earliest/best.
    use utf8;
    use strictures;
    use open ":std", ":encoding(utf8)";
    use Unicode::UCD "charinfo";
    use charnames ":full";
    use YAML; # For introspection if you're curious.
    
    my @sex_and_porn = <DATA>;
    
    for my $sap ( @sex_and_porn )
    {
        chomp $sap;
        my @string;
        for my $chr ( split "", $sap )
        {
            next unless $chr =~ /\A[[:print:]]+\z/;
            my $info = charinfo(ord($chr));
            # print Dump($info);
            my $name = $info->{name};
            next if $name =~ /COMBINING/;
            my $replacement;
            if ( $name =~ /\b(SMALL|CAPITAL)?(?: LETTER)?(?: TURNED| SHARP)? ([A-Z])\b/ )
            {
                no warnings "uninitialized";
                my $case = $1 eq "SMALL" ? "SMALL" : "CAPITAL";
                $replacement = "LATIN $case LETTER $2";
            }
    
            my $final = charnames::string_vianame( $replacement || $name );
    
            push @string, $final =~ /(?:\b[A-Z]\b| )/i ?
                $final : "\x{0}";
        }
        s/\x{0}+/[?]/g for @string;
        printf " %15s -> %s\n\n", $sap, join "", @string;
    }
    
    __DATA__
    s3x & pr0n
    s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝
    รєא คภ๔ ק๏гภ
    ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ
    𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗
    uɹod puɐ xǝs
    🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽
    🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀
    𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻 
    ßêx and pørñ
    ẞӬ𐊜 AƝḊ ⓅỚℝꞐ
    

    Output–

          s3x & pr0n -> s[?]x [?] pr[?]n
    
     s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝ -> sex and porn
    
        รєא คภ๔ ק๏гภ -> [?][?][?] [?][?][?] [?][?][?][?]
    
        ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ -> sex and porn
    
        𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗 -> sex and porn
    
        uɹod puɐ xǝs -> urod pua xes
    
        🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽 -> SEX AND PORN
    
     🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀 -> [?] sex and p[?]rn [?]
    
       𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻  -> sex and porn 
    
        ßêx and pørñ -> sex and porn
    
        ẞӬ𐊜 AƝḊ ⓅỚℝꞐ -> SEX AND PORN
    
    
Re^3: Bad Language In Contact Messages
by cavac (Prior) on Jul 12, 2022 at 10:03 UTC

    And it gets worse if you have to support multiple languages. If the company is in North America, they can probably get away with supporting 3 different languages (spanish, french and english).

    If the company in the EU, there are 24 official languages at last count: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.

    You might get away with supporting fewer languages in some private sectors, but if you are running a government system (or a system that wants to receive funding from the government), you might have to support all of those. And as anyone can imagine, a "bad" word in one of those languages might be a standard word with a different meaning in another language.

    Plus, even for countries speaking the same language, there might be big cultural differences. For U.S. citizens, a "tea party" means throwing valuable cargo into the harbor because they endorse "taxation without representation" but want to keep all the taxes for their own government(*). For Brits, it means "5 p.m."

    (*) Puerto Rico

    PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP

        that's why slang and jargon were invented. People have been leaving bad language comments since the time a storm was brewing up in Eden. And I thank them, mostly.

Re^3: Bad Language In Contact Messages
by LanX (Saint) on Jul 09, 2022 at 16:31 UTC