The Queensrÿche Situation

Rodster001 has asked for the wisdom of the Perl Monks concerning the following question:

It seems I have read every page on character encoding I can find. But I am missing something. I still have a bit of confusion which I hope can get cleared up here.

#!/usr/bin/perl

use strict;
use Encode;
use Text::Unaccent::PurePerl qw(unac_string);

use utf8;
my $string = "Queensrÿche";
no utf8;

chars($string);

(Encode::is_utf8($string))? print " - this is utf8\n" : print " - this
+ is NOT utf8\n";

print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) 
+. "\n";

print $string;

exit;


sub chars {

        my $k = shift;

        my @chars = split("",$k);

        foreach (@chars) {

                my $dec = ord($_);
                my $chr = chr(ord($_));
                my $escape = qquote($_);
                print "\t$dec\t$chr\t$escape\n";
        }
}


sub qquote {

        local($_) = shift;

        s/([\\\"\@\$])/\\$1/g;

        my $bytes; { use bytes; $bytes = length }

        s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes 
+> length;

        return $_;
}
[download]

This is what I am seeing in my terminal (I am using secure crt, with Terminal > Appearance > Character encoding: UTF-8)

        81      Q       Q
        117     u       u
        101     e       e
        101     e       e
        110     n       n
        115     s       s
        114     r       r
        255     {ff}
        99      c       c
        104     h       h
        101     e       e
 - this is utf8
unaccented: Queensryche
Queensr
[download]

Here are my questions about this:

Why is the "ÿ" not printing correctly here in my terminal?
ord() returns 255 for ÿ, a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?
Utf-16 table: http://asecuritysite.com/coding/asc2

I have another version of "Queensrÿche" (in a JSON file), when I parse that and run it though the same thing, this is what I get:

        81      Q       Q
        117     u       u
        101     e       e
        101     e       e
        110     n       n
        115     s       s
        114     r       r
        195     {c3}
        191     {bf}
        99      c       c
        104     h       h
        101     e       e
 - this is utf8
unaccented: QueensrA
Queensrÿche
[download]

This is where the deep confusion is for me.

This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?
ord() returns two bytes for ÿ, 195 and 191 which matches up with this table:
Utf-8 table: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec
Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?

Finally. Since these two strings cannot be compared and matched as being the same (which I understand why) I need to "normalize" them.

Is #1 Queensrÿche or #2 Queensrÿche actually utf-8? (Or are they both actually utf-8 as Encode believes?)
Is there a way to safely convert them to the same encoding? I would like to preserve the ÿ but I would also like to be able to use Text::Unaccent::PurePerl.

Thanks!

Update #1:
--------------------------------------

Taking out the "use utf8" and "no utf8":

#use utf8;
my $string = "Queensrÿche";
#no utf8;
[download]

And then running it again:

        81      Q       Q
        117     u       u
        101     e       e
        101     e       e
        110     n       n
        115     s       s
        114     r       r
        195     
       191     
       99      c       c
        104     h       h
        101     e       e
 - this is NOT utf8
unaccented: QueensrA

                     Queensrÿche
[download]

This confuses me even more. I understand the utf8 flag is not set now, so Encode doesn't see it as utf8. But I see the two utf-8 bytes for the "ÿ" are there (195 191) instead of 255 when using "use utf8". It prints correctly (and displays in my terminal properly) but does not unaccent correctly. Much confusion.

Comment on The Queensrÿche Situation Select or Download Code

Replies are listed 'Best First'.
Re: The Queensrÿche Situation by aitap (Curate) on Oct 19, 2014 at 19:00 UTC
You didn't use binmode to apply an IOLayer to encode Unicode characters you print to STDOUT, neither you encode them manually. When Perl encounters characters where it expects bytes (in any IO) it applies some heuristics to translate the former to the latter. Usually it means that what can be translated to latin1 gets (silently!) translated and everything else is printed in utf8 (with a warning): $ perl -w -Mutf8 -E'say "ы"; say "ÿ";' Wide character in say at -e line 1. ы � (my terminal is utf-8) And when you `use utf8`, Perl decodes utf8 byte string literals into characters for you. The same is done by Encode::decode. Does adding `binmode STDOUT, ":utf8";` fix your problem? You can also use `:encoding(...)` IOLayers to encode into other encodings.	[reply] [d/l] [select]
Re^2: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 19:24 UTC
Yes! That fixes the printing problem in my terminal. And this makes complete sense now. Thank you for clearing this up! One problem remains that I still don't quite understand. #!/usr/bin/perl use strict; use Encode; use Text::Unaccent::PurePerl; binmode STDOUT, ":utf8"; use utf8; my $string = "Queensrÿche"; no utf8; chars($string); (Encode::is_utf8($string))? print "this is utf8\n" : print "this is NO +T utf8\n"; print "$string\n"; print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) +. "\n"; exit; sub chars { my $k = shift; my @chars = split("",$k); foreach (@chars) { my $dec = ord($_); my $chr = chr(ord($_)); my $q = qquote($_); print "\t$dec\t$chr\t$q\n"; } } sub qquote { local($_) = shift; s/([\\\"\@\$])/\\$1/g; my $bytes; { use bytes; $bytes = length } s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes +> length; return $_; [download] Why does that produce, this: `81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 ÿ \x{ff} 99 c c 104 h h 101 e e this is utf8 Queensrÿche unaccented: Queensryche` [download] Is that actually valid utf-8? Shouldn't the ÿ be two bytes (decimal 195 191)? Like this: `81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e` [download]	[reply] [d/l] [select]
Re^3: The Queensrÿche Situation by aitap (Curate) on Oct 19, 2014 at 20:01 UTC
When you work with Unicode, you should get greater character codes (>=255), not byte sequences, because Perl encapsulates encodings for you. For example, use utf8; binmode STDOUT, ":utf8"; my $string = "Queensrÿche ы"; printf "%x\t%s\n", ord($_), $_ for split "", $string; __END__ 51 Q 75 u 65 e 65 e 6e n 73 s 72 r ff ÿ 63 c 68 h 65 e 20 44b ы If you need to work with utf-8 bytes, encode them back: use utf8; use Encode 'encode'; binmode STDOUT, ":utf8"; my $string = "Queensrÿche ы"; printf "%x\t%s\n", ord($_), $_ for split "", encode utf8 => $string; __END__ 51 Q 75 u 65 e 65 e 6e n 73 s 72 r c3 Ã bf ¿ 63 c 68 h 65 e 20 d1 Ñ 8b But there would be no point in using utf8 and Encode in this case.	[reply]
Re^4: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 21:00 UTC
Re^3: The Queensrÿche Situation by karlgoethebier (Abbot) on Oct 19, 2014 at 21:02 UTC
"Yes! That fixes the printing problem in my terminal!" Thats nice. But just to add a little bit confusion., please see this: A One-Liner prints it out as expected: `karl$ perl -e 'print qq(Queensrÿche\n)' Queensrÿche` [download] But please see what happens when i put the stuff into a script (in the same terminal session): `#!/usr/bin/env perl use strict; use warnings; binmode STDOUT, ":utf8"; my $string = qq(Queensrÿche); print qq($string\n); my $y_with_trema = qq(\N{LATIN SMALL LETTER Y WITH DIAERESIS}); print qq($y_with_trema\n); $string = qq(Queensr) . $y_with_trema . qq(che); print qq($string\n); __END__ karls-mac-mini:monks karl$ ./roadster001.pl QueensrÃ¿che ÿ Queensrÿche` [download] Seems like things are getting weird. I wonder when i ever will understand this crap. N.B.: I came in a bit late and didn't read all the posts yet. Best regards, Karl «The Crux of the Biscuit is the Apostrophe»	[reply] [d/l] [select]
Re^4: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 21:38 UTC
Re^5: The Queensrÿche Situation by karlgoethebier (Abbot) on Oct 21, 2014 at 07:56 UTC
Re^3: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 20:00 UTC
I figured it out, sort of. The first is actually ascii (255 maps to "ÿ"): http://www.ascii-code.com So, when I take the string "Queensrÿche" (which IS actually encoded as utf-8) for example: `Decimal Char escaped 81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e` [download] It is now printing on my terminal like this: `QueensrÃ¿che` [download] This makes sense, in a way, now because 195 maps to "Ã" and 191 maps to "¿". So, now my question is, why isn't this mapping using a utf-8 table (instead of ascii)? Encode thinks the string is utf-8 (which I assume means the utf-8 flag is on).	[reply] [d/l] [select]
Re: The Queensrÿche Situation by Jim (Curate) on Oct 19, 2014 at 20:39 UTC
If you only have to deal with Unicode—and you properly should only have to deal with Unicode in this millennium—then use the Unicode collation algorithm instead of something non-standard. In Perl, this means using Unicode::Collate. Both the Unicode collation algorithm and the Perl CPAN module Unicode::Collate are customizable. use strict; use warnings; # This Perl script is Unicode UTF-8 use utf8; # Proper Unicode collation use Unicode::Collate; # The output of this Perl script is Unicode UTF-8 binmode STDOUT, ':encoding(UTF-8)'; my $fancy = 'Queensrÿche'; my $plain = 'Queensryche'; my $collator = Unicode::Collate->new( level => 1, normalization => undef, ); # This prints "Queensrÿche and Queensryche are the same word." printf "$fancy and $plain %s the same word.\n", $collator->eq($fancy, $plain) ? "are" : "aren't"; exit 0; [download] As it says in the script, this correctly prints "Queensrÿche and Queensryche are the same word." Whether or not this is exactly what's displayed in your terminal window is another matter altogether—one that's not related to the Perl script. See Perl Unicode Cookbook: Case- and Accent-insensitive Comparison by Tom Christiansen (tchrist). Update: By the way, in this same configuration of Unicode::Collate, the strings "QUEENSRŸCHE" and "Queensryche" will compare equal as well.	[reply] [d/l]
Re: The Queensrÿche Situation by ikegami (Patriarch) on Oct 19, 2014 at 22:49 UTC
Why is the "ÿ" not printing correctly here in my terminal? Your terminal expects UTF-8. You printed chr(0xFF), which is not the UTF-8 encoding of "ÿ". You can encode it yourself, or you ask Perl to do it using the following: `use open ':std', ':encoding(UTF-8)';` [download] ord() returns 255 for ÿ, a single byte. Encode thinks this is utf-8, but isn't this actually utf-16? It's not UTF-8 (which would be `C3 BF`). `is_utf8($string)` does not indicate whether `$string` contains UTF-8. It's not UTF-16 (which would be `00 FF` or `FF 00` depending on endianness). Decoding string (as `use utf8;` does for literals) results in Unicode Code Points ("ÿ" is U+00FF). This actually looks like valid UTF-8 to me and Encode agrees. Is that correct? That is the UTF-8 encoding of "Queensrÿche", though it is incorrect to say that `is_utf8` signifies that Encode agrees. Text::Unaccent::PurePerl does not "unaccent" it properly. Why not? Tools that work with text (such as regular expressions and Text::Unaccent::PurePerl) usually expect the text to be provided as strings of Unicode Code Points, not encoded using UTF-8. Is there a way to safely convert them to the same encoding? Aformentioned `use open ':std', ':encoding(UTF-8)';` [download] will also tell Perl to decode bytes read from file handles. use utf8; use encoding ':std', ':encoding(UTF-8)'; use JSON::XS qw( decode_json encode_json ); my $s = "Queensrÿche"; printf("U+%v04X %s\n", $s, $s); { # Uses encoding specified by "use open". open(my $fh, '>', 'foo.txt') or die $!; print($fh "$s\n"); } { # Uses encoding specified by "use open". open(my $fh, '<', 'foo.txt') or die $!; chomp( my $got = <$fh> ); printf("U+%v04X %s\n", $got, $got); } { # :raw overrides default encoding specified above # since encode_json already encodes using UTF-8 open(my $fh, '>:raw', 'foo.json') or die $!; print($fh encode_json( { text => $s } )); } { my $json = do { # Similarly, decode_json expects UTF-8. open(my $fh, '<:raw', 'foo.json') or die $!; local $/; <$fh> }; my $got = decode_json($json)->{text}; printf("U+%v04X %s\n", $got, $got); } [download]	[reply] [d/l] [select]
Re^2: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 23:16 UTC
Got it. So, "is_utf8" just tells us that the utf-8 flag is set?	[reply]
Re^3: The Queensrÿche Situation by ikegami (Patriarch) on Oct 20, 2014 at 02:18 UTC
Exactly. It merely says which internal storage format is used. It's only useful for debugging XS modules, if at all. (Added plain text example to the program in my earlier post.)	[reply]
Re: The Queensrÿche Situation by LanX (Saint) on Oct 19, 2014 at 18:06 UTC
Many question, but I'd be surprised if the default font of your terminal supported a fictitious° character like ÿ. See also Metal Umlaut! :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)} °) well maybe not fictitious but very rare. But the Latin 1 code is 255 which answers another question. update Btw its not an umlaut! In German its a medieval handwriting ligature of ij, a diphthong still found in Dutch (see rijk), those sounds are written ei in modern German (see Reich) In French trema accents are used to pronounce adjacent vowels separately (see Citroën or naïve). English imported some of them.	[reply]
Re^2: The Queensrÿche Situation by Tux (Canon) on Oct 19, 2014 at 19:16 UTC
The Dutch ĳ is still regarded as a single syllable, but written as ij. Even in official documents the ĳ has been banned. I however bet that every Dutch person will have no trouble reading the ĳ when ij was meant and vice versa. I think that many of you won't even see the difference in their browser (unless off course ĳ is not represented in your font). Enjoy, Have FUN! H.Merijn	[reply]
Re^3: The Queensrÿche Situation by LanX (Saint) on Oct 19, 2014 at 19:53 UTC
A single letter, really? Interesting IJ_(digraph) In standard German single vowels are always monophthongs. At least I know now where the Swiss canton of Schwyz got its y from :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply]
Re^2: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 18:13 UTC
Sorry for the confusing nature of this post. I suppose it really just comes down to this. Which of these are utf8? `81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 {c3} 191 {bf} 99 c c 104 h h 101 e e 81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 {ff} 99 c c 104 h h 101 e e` [download]	[reply] [d/l]
Re^3: The Queensrÿche Situation by LanX (Saint) on Oct 19, 2014 at 18:21 UTC
As I can see from the German WP page does 255 (FF) represent the Latin 1 code. And google is your friend http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=0x&unicodeinhtml=hex C3 BF is utf 8. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply]
Re^4: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 18:42 UTC
Re^5: The Queensrÿche Situation by Jim (Curate) on Oct 19, 2014 at 20:02 UTC
Re: The Queensrÿche Situation by Jim (Curate) on Oct 19, 2014 at 22:17 UTC
I highly recommend using these two companion applications when working with Unicode text as well as text in other vendor and national character sets (so-called "legacy" character encodings): BabelMap (Unicode character map for Windows) and BabelPad (Unicode text editor for Windows). They're both extraordinarily helpful when getting down 'n' dirty with Unicode.	[reply]


P is for Practical
	PerlMonks

The Queensrÿche Situation

update