in reply to Re: The Queensr’che Situation
in thread The Queensr’che Situation

Sorry for the confusing nature of this post. I suppose it really just comes down to this. Which of these are utf8?
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 {c3} 191 {bf} 99 c c 104 h h 101 e e 81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 {ff} 99 c c 104 h h 101 e e

Replies are listed 'Best First'.
Re^3: The Queensr’che Situation
by LanX (Saint) on Oct 19, 2014 at 18:21 UTC
      Right. So #1 is utf-8. Then #2 is utf-16?

      So then why does this:

      use utf8; my $string = "Queensr’che"; no utf8;
      Produce this:
      81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 {ff} 99 c c 104 h h 101 e e - this is utf8
      When this:
      #use utf8; my $string = "Queensr’che"; #no utf8;
      Produces this:
      81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 191 99 c c 104 h h 101 e e - this is NOT utf8
      If the two bytes are "there", why is "use utf8" yielding a dec 255 for the "’" which is not valid utf8?

      "The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode." - http://en.wikipedia.org/wiki/UTF-8

        Right. So #1 is utf-8. Then #2 is utf-16?

        No, #2 is ISO-8859-1, which is also known as Latin 1. As it happens, it's also Windows-1252, which today is really a quasi-superset of ISO-8859-1. Neither ISO-8859-1 nor Windows-1252 are Unicode at all, so #2 is not in any Unicode character encoding scheme such as UTF-16.

        The character encodings ISO-8859-1 (Latin 1) and Windows-1252 are often referred to as "legacy encodings," especially vis-ą-vis Unicode.