Removing multibyte UTF-8 chars from strings

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. I am processing some twitter user names that have unwanted multibyte UTF-8 characters in them. I got the ORD values of these using the following script:

use feature ':5.10';
my $foo = '&#8294;JenAFifield&#8297;';
foreach my $i (0..length($foo)) {
    $char = substr($foo,$i,1);
    $charnum = ord($char);
    say "$char\t$charnum"; 
}
[download]

but I see this forum is replacing the characters with some other encoding I don't recognize. In my editor (Komodo) They are at the beginning and end of the $foo string. The beginning one is a dotted box with LRI inside, and the end one has the same box with PDI inside. The script returns:

where in my editor the undisplayable char is a black rectangle with HOP inside. I'm not sure why it displays like that because ASCII 129 should be u-umlaut.

Am wondering how to do a regexp that will get rid of these chars. Based on numbers shown here I tried $foo =~ s/\x8294|\x8297//g; but that didn't do it. Can anyone help?

I am Satoshi Nakamoto

Comment on Removing multibyte UTF-8 chars from strings Select or Download Code

Replies are listed 'Best First'.
Re: Removing multibyte UTF-8 chars from strings by Corion (Patriarch) on Jan 10, 2022 at 18:18 UTC
You don't show us where the string is initialized. If you have the string verbatim in your editor, you might want to save the file with the UTF-8 encoding and then `use utf8;` at the top. Personally, I prefer to `use charnames ':full';` and then write the characters using `\N{...}` named escapes. As for the replacement target, you also need to tell/show us where you get it from, and you need to tell Perl what encoding the string is in. Maybe/most likely, the string already is UTF-8 but Perl doesn't know it. Then you should tell it to Perl by using: `use Encode 'decode'; ... my $string = decode('UTF-8', $input_string); # Keep only what we want: $string =~ m!([a-zA-Z0-9]+)! or warn "Invalid/empty username in '$string'"; my $real_user = $1; # Remove stuff we don't want, especially the writing direction isolate +s: $string =~ s!\x{2066}\|\x{2069}!!g;` [download]	[reply] [d/l] [select]
Re^2: Removing multibyte UTF-8 chars from strings by cormanaz (Deacon) on Jan 10, 2022 at 19:27 UTC
Ya sorry, I was reading from a file and clipped the offending chars from that. The closing regex did the trick. Never heard of a "direction isolate."	[reply]