It seems I have read every page on character encoding I can find. But I am missing something. I still have a bit of confusion which I hope can get cleared up here.
#!/usr/bin/perl
use strict;
use Encode;
use Text::Unaccent::PurePerl qw(unac_string);
use utf8;
my $string = "Queensr˙che";
no utf8;
chars($string);
(Encode::is_utf8($string))? print " - this is utf8\n" : print " - this
+ is NOT utf8\n";
print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string)
+. "\n";
print $string;
exit;
sub chars {
my $k = shift;
my @chars = split("",$k);
foreach (@chars) {
my $dec = ord($_);
my $chr = chr(ord($_));
my $escape = qquote($_);
print "\t$dec\t$chr\t$escape\n";
}
}
sub qquote {
local($_) = shift;
s/([\\\"\@\$])/\\$1/g;
my $bytes; { use bytes; $bytes = length }
s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes
+> length;
return $_;
}
This is what I am seeing in my terminal (I am using secure crt, with Terminal > Appearance > Character encoding: UTF-8)
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
255 {ff}
99 c c
104 h h
101 e e
- this is utf8
unaccented: Queensryche
Queensr
Here are my questions about this:
- Why is the "˙" not printing correctly here in my terminal?
- ord() returns 255 for ˙, a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?
Utf-16 table: http://asecuritysite.com/coding/asc2
I have another version of "Queensr˙che" (in a JSON file), when I parse that and run it though the same thing, this is what I get:
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
195 {c3}
191 {bf}
99 c c
104 h h
101 e e
- this is utf8
unaccented: QueensrA
Queensr˙che
This is where the deep confusion is for me.
- This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?
ord() returns two bytes for ˙, 195 and 191 which matches up with this table:
Utf-8 table: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec
- Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?
Finally. Since these two strings cannot be compared and matched as being the same (which I understand why) I need to "normalize" them.
- Is #1 Queensr˙che or #2 Queensr˙che actually utf-8? (Or are they both actually utf-8 as Encode believes?)
- Is there a way to safely convert them to the same encoding? I would like to preserve the ˙ but I would also like to be able to use Text::Unaccent::PurePerl.
Thanks!
Update #1:
--------------------------------------
Taking out the "use utf8" and "no utf8":
#use utf8;
my $string = "Queensr˙che";
#no utf8;
And then running it again:
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
195
191
99 c c
104 h h
101 e e
- this is NOT utf8
unaccented: QueensrA
Queensr˙che
This confuses me even more. I understand the utf8 flag is not set now, so Encode doesn't see it as utf8. But I see the two utf-8 bytes for the "˙" are there (195 191) instead of 255 when using "use utf8". It prints correctly (and displays in my terminal properly) but does not unaccent correctly. Much confusion.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.