http://qs1969.pair.com?node_id=1104325

Rodster001 has asked for the wisdom of the Perl Monks concerning the following question:

It seems I have read every page on character encoding I can find. But I am missing something. I still have a bit of confusion which I hope can get cleared up here.
#!/usr/bin/perl use strict; use Encode; use Text::Unaccent::PurePerl qw(unac_string); use utf8; my $string = "Queensr˙che"; no utf8; chars($string); (Encode::is_utf8($string))? print " - this is utf8\n" : print " - this + is NOT utf8\n"; print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) +. "\n"; print $string; exit; sub chars { my $k = shift; my @chars = split("",$k); foreach (@chars) { my $dec = ord($_); my $chr = chr(ord($_)); my $escape = qquote($_); print "\t$dec\t$chr\t$escape\n"; } } sub qquote { local($_) = shift; s/([\\\"\@\$])/\\$1/g; my $bytes; { use bytes; $bytes = length } s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes +> length; return $_; }
This is what I am seeing in my terminal (I am using secure crt, with Terminal > Appearance > Character encoding: UTF-8)
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 {ff} 99 c c 104 h h 101 e e - this is utf8 unaccented: Queensryche Queensr
Here are my questions about this:
  1. Why is the "˙" not printing correctly here in my terminal?
  2. ord() returns 255 for ˙, a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?
    Utf-16 table: http://asecuritysite.com/coding/asc2
I have another version of "Queensr˙che" (in a JSON file), when I parse that and run it though the same thing, this is what I get:
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 {c3} 191 {bf} 99 c c 104 h h 101 e e - this is utf8 unaccented: QueensrA Queensr˙che
This is where the deep confusion is for me.
  1. This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?
    ord() returns two bytes for ˙, 195 and 191 which matches up with this table:
    Utf-8 table: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec
  2. Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?
Finally. Since these two strings cannot be compared and matched as being the same (which I understand why) I need to "normalize" them.
  1. Is #1 Queensr˙che or #2 Queensr˙che actually utf-8? (Or are they both actually utf-8 as Encode believes?)
  2. Is there a way to safely convert them to the same encoding? I would like to preserve the ˙ but I would also like to be able to use Text::Unaccent::PurePerl.
Thanks!

Update #1:
--------------------------------------

Taking out the "use utf8" and "no utf8":

#use utf8; my $string = "Queensr˙che"; #no utf8;
And then running it again:
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 191 99 c c 104 h h 101 e e - this is NOT utf8 unaccented: QueensrA Queensr˙che
This confuses me even more. I understand the utf8 flag is not set now, so Encode doesn't see it as utf8. But I see the two utf-8 bytes for the "˙" are there (195 191) instead of 255 when using "use utf8". It prints correctly (and displays in my terminal properly) but does not unaccent correctly. Much confusion.