Rodster001 has asked for the wisdom of the Perl Monks concerning the following question:
It seems I have read every page on character encoding I can find. But I am missing something. I still have a bit of confusion which I hope can get cleared up here.
This is what I am seeing in my terminal (I am using secure crt, with Terminal > Appearance > Character encoding: UTF-8)#!/usr/bin/perl use strict; use Encode; use Text::Unaccent::PurePerl qw(unac_string); use utf8; my $string = "Queensr˙che"; no utf8; chars($string); (Encode::is_utf8($string))? print " - this is utf8\n" : print " - this + is NOT utf8\n"; print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) +. "\n"; print $string; exit; sub chars { my $k = shift; my @chars = split("",$k); foreach (@chars) { my $dec = ord($_); my $chr = chr(ord($_)); my $escape = qquote($_); print "\t$dec\t$chr\t$escape\n"; } } sub qquote { local($_) = shift; s/([\\\"\@\$])/\\$1/g; my $bytes; { use bytes; $bytes = length } s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes +> length; return $_; }
Here are my questions about this:81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 {ff} 99 c c 104 h h 101 e e - this is utf8 unaccented: Queensryche Queensr
- Why is the "˙" not printing correctly here in my terminal?
- ord() returns 255 for ˙, a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?
Utf-16 table: http://asecuritysite.com/coding/asc2
This is where the deep confusion is for me.81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 {c3} 191 {bf} 99 c c 104 h h 101 e e - this is utf8 unaccented: QueensrA Queensr˙che
- This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?
ord() returns two bytes for ˙, 195 and 191 which matches up with this table:
Utf-8 table: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec - Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?
- Is #1 Queensr˙che or #2 Queensr˙che actually utf-8? (Or are they both actually utf-8 as Encode believes?)
- Is there a way to safely convert them to the same encoding? I would like to preserve the ˙ but I would also like to be able to use Text::Unaccent::PurePerl.
Update #1:
--------------------------------------
Taking out the "use utf8" and "no utf8":
And then running it again:#use utf8; my $string = "Queensr˙che"; #no utf8;
This confuses me even more. I understand the utf8 flag is not set now, so Encode doesn't see it as utf8. But I see the two utf-8 bytes for the "˙" are there (195 191) instead of 255 when using "use utf8". It prints correctly (and displays in my terminal properly) but does not unaccent correctly. Much confusion.81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 191 99 c c 104 h h 101 e e - this is NOT utf8 unaccented: QueensrA Queensr˙che
Back to
Seekers of Perl Wisdom