cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. I am processing some twitter user names that have unwanted multibyte UTF-8 characters in them. I got the ORD values of these using the following script:
use feature ':5.10'; my $foo = '⁦JenAFifield⁩'; foreach my $i (0..length($foo)) { $char = substr($foo,$i,1); $charnum = ord($char); say "$char\t$charnum"; }
but I see this forum is replacing the characters with some other encoding I don't recognize. In my editor (Komodo) They are at the beginning and end of the $foo string. The beginning one is a dotted box with LRI inside, and the end one has the same box with PDI inside. The script returns:
226 129 166 J 74 e 101 n 110 A 65 F 70 i 105 f 102 i 105 e 101 l 108 d 100 226 129 169
where in my editor the undisplayable char is a black rectangle with HOP inside. I'm not sure why it displays like that because ASCII 129 should be u-umlaut.

Am wondering how to do a regexp that will get rid of these chars. Based on numbers shown here I tried $foo =~ s/\x8294|\x8297//g; but that didn't do it. Can anyone help?

I am Satoshi Nakamoto

Replies are listed 'Best First'.
Re: Removing multibyte UTF-8 chars from strings
by Corion (Patriarch) on Jan 10, 2022 at 18:18 UTC

    You don't show us where the string is initialized.

    If you have the string verbatim in your editor, you might want to save the file with the UTF-8 encoding and then use utf8; at the top. Personally, I prefer to use charnames ':full'; and then write the characters using \N{...} named escapes.

    As for the replacement target, you also need to tell/show us where you get it from, and you need to tell Perl what encoding the string is in. Maybe/most likely, the string already is UTF-8 but Perl doesn't know it. Then you should tell it to Perl by using:

    use Encode 'decode'; ... my $string = decode('UTF-8', $input_string); # Keep only what we want: $string =~ m!([a-zA-Z0-9]+)! or warn "Invalid/empty username in '$string'"; my $real_user = $1; # Remove stuff we don't want, especially the writing direction isolate +s: $string =~ s!\x{2066}|\x{2069}!!g;
      Ya sorry, I was reading from a file and clipped the offending chars from that. The closing regex did the trick. Never heard of a "direction isolate."