Re: Remove u200b unicode From String

If you can print the character to somewhere where you can copy it, e.g. to an xterm, you can just paste it into your regular expression and it should work. For example, using the codepoint 478 which is an A with some dots above:

perl -we '$chr = "Ǟ"; $s = "abc" . $chr . "xyz"; print "$s\n"; $s =~ s/$chr/ /g; print "$s\n"'

outputs

abcǞxyz
abc xyz

Alternatively, you can do something like the following to find characters outside the ascii range:

use Encode;
my $s = get_s_from_somewhere();
my $chars = decode("UTF-8", $s);
my %non_ascii;

for my $i (0..length($chars)-1) {
  if( ord(substr($chars, $i, 1)) > 127 ) {
    $non_ascii{ substr($chars, $i, 1) }++;
  }
}

do_something_with_non_ascii(\%non_ascii)
[download]

Comment on Re: Remove u200b unicode From String Download Code