Re^4: The Queensrÿche Situation

Right. So #1 is utf-8. Then #2 is utf-16?

So then why does this:


use utf8;
my $string = "Queensrÿche";
no utf8;
[download]

Produce this:

        81      Q       Q
        117     u       u
        101     e       e
        101     e       e
        110     n       n
        115     s       s
        114     r       r
        255     {ff}
        99      c       c
        104     h       h
        101     e       e
 - this is utf8
[download]

When this:


#use utf8;
my $string = "Queensrÿche";
#no utf8;
[download]

Produces this:

        81      Q       Q
        117     u       u
        101     e       e
        101     e       e
        110     n       n
        115     s       s
        114     r       r
        195     
       191     
       99      c       c
        104     h       h
        101     e       e
 - this is NOT utf8
[download]

If the two bytes are "there", why is "use utf8" yielding a dec 255 for the "ÿ" which is not valid utf8?

"The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode." - http://en.wikipedia.org/wiki/UTF-8

Comment on Re^4: The Queensrÿche Situation Select or Download Code

Replies are listed 'Best First'.
Re^5: The Queensrÿche Situation by Jim (Curate) on Oct 19, 2014 at 20:02 UTC
Right. So #1 is utf-8. Then #2 is utf-16? No, #2 is ISO-8859-1, which is also known as Latin 1. As it happens, it's also Windows-1252, which today is really a quasi-superset of ISO-8859-1. Neither ISO-8859-1 nor Windows-1252 are Unicode at all, so #2 is not in any Unicode character encoding scheme such as UTF-16. The character encodings ISO-8859-1 (Latin 1) and Windows-1252 are often referred to as "legacy encodings," especially vis-à-vis Unicode.	[reply]

Replies are listed 'Best First'.

Re^5: The Queensrÿche Situation
by Jim (Curate) on Oct 19, 2014 at 20:02 UTC

Right. So #1 is utf-8. Then #2 is utf-16?

No, #2 is ISO-8859-1, which is also known as Latin 1. As it happens, it's also Windows-1252, which today is really a quasi-superset of ISO-8859-1. Neither ISO-8859-1 nor Windows-1252 are Unicode at all, so #2 is not in any Unicode character encoding scheme such as UTF-16.

The character encodings ISO-8859-1 (Latin 1) and Windows-1252 are often referred to as "legacy encodings," especially vis-à-vis Unicode.

[reply]