comment on

It's tricky, isn't it.

Two things make this tricky, first it can be difficult to see where bytes and characters are being encoded/decoded, and second Perl handles "old-fashioned" strings of bytes as well as "new-fangled" wide characters.

NB: the following applies to Perl v5.8.8 or later. There is, apparently, an EBCDIC Perl, which I know nothing about.

The actors in this drama are:

Perl's string handling -- including encode/decode.
the Perl IO Layers.
the OS and display devices -- ie everything else !

The following attempts to show, one step at a time, how the input, output and handling of wide character strings can be achieved, starting with byte strings and working up.

I confess this turned out to be a lot longer than I had expected/intended. I hope somebody will find it useful.

Starting at the top, Perl handles two forms of string. The first form is, fundamentally, an array of unsigned char (in C terms) -- byte form. Where you ask Perl to interpret these as characters, it will assume at least ASCII (eg uc($s)) -- possibly more if you use deeper magic. The second form is, in effect, an array of wide character ordinals in the range 0..(2^32)-1 (roughly speaking). When you ask Perl to interpret these as characters, it will assume Unicode. Attached to every string is a "mode-bit", telling Perl whether it contains bytes or wide characters.

The following will illustrate some of what is going on as Perl, Perl IO and the OS conspire together:

  use strict ;
  use warnings ;

  my $UT8 = 0 ;   # Option to "use utf8" -- may be commented out, below
  my $ECO = 0 ;   # Option to set ":encoding(utf8)" on STDOUT
  my $ECI = 0 ;   # Option to set ":encoding(utf8)" on STDIN
  my $UPG = 0 ;   # Option to "utf8::upgrade($s)" in show()
  my $UC  = 0 ;   # Option to "uc($s)" in show()

  foreach (@ARGV) {
    if    ($_ eq 'eco') { $ECO = 1 ; }
    elsif ($_ eq 'eci') { $ECI = 1 ; }
    elsif ($_ eq 'upg') { $UPG = 1 ; }
    elsif ($_ eq 'uc')  { $UC  = 1 ; }
    else  { die "$_ not known" ;     } ;
  } ;

  #use utf8 ;  $UT8 = 1 ;

  if ($ECO) { binmode STDOUT, ":encoding(utf8)" ; } ;
  if ($ECI) { binmode STDIN , ":encoding(utf8)" ; } ;

  my $m = '' ;
  if ($UT8)  { $m .= " use utf8 ;" ;                } ;
  if ($ECO)  { $m .= " STDOUT :encoding(utf8) ;" ;  } ;
  if ($ECI)  { $m .= " STDIN :encoding(utf8) ;" ;   } ;
  if ($UPG)  { $m .= " utf8::upgrade(\$s) ;" ;      } ;
  if ($UC)   { $m .= " uc(\$s) ;" ;                 } ;
  if ($m)    { print "Options:$m\n" ;               } ;

  show(1, "Hello World") ;
  show(2, "Hello W\xF6rld") ;
  show(3, "Hello W\x{14D}rld") ;
  show(4, "Hello Wörld") ;        # ord('ö') is 0xF6
  show(5, "Hello Wōrld") ;        # ord('ō') is 0x14D

  my $n = 'a' ;
  while (my $s = <STDIN>) {
    chomp($s) ;
    show($n++, $s) ;
  } ;

  sub show {
    my ($n, $s) = @_ ;
    if ($UPG) { utf8::upgrade($s) ; } ;
    if ($UC)  { $s = uc($s) ;       } ;
    print " $n: '$s' is ", utf8::is_utf8($s) ? "'wide'" : "'byte'",
            " len=", length($s), " \"", peek($s). "\"\n" ;
  } ;

  sub peek {      # Peek at the byte contents of the given string
    my ($s) = @_ ;

    use bytes ;   # Forces the unpack to show the byte contents of $s

    return join ('', map { ($_ >= 0x20) && ($_ < 0x7F) ? chr($_) : sprintf('\\x%02X', $_)
                         } unpack('C*', $s)) ;
  } ;

and the file read via STDIN is

  Hello World
  Hello Wörld
  Hello Wōrld

With all the options off, on my machine the code above gave:

 1: 'Hello World' is 'byte' len=11 "Hello World"
 2: 'Hello W▒rld' is 'byte' len=11 "Hello W\xF6rld"
Wide character in print at x.pl line 47.
 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
 4: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld"
 5: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"
 a: 'Hello World' is 'byte' len=11 "Hello World"
 b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld"
 c: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"

YMMV. This shows a number of things:

that literal strings are byte strings, unless a character with ordinal > 0xFF forces otherwise.

(There does not appear to be a way to declare that a given literal string should be held as a "wide" string.)
if we looked at the source with our favourite hex editor, we'd see that the characters 'ö' and 'ō' appear in the file as the UTF-8 sequences "\xC3\xB6" and "\xC5\x8D" respectively. Perl happily accepts those byte values, and the string's mode, length and contents reflect that.
the string that is "wide" is actually held with characters > 0x7F encoded as UTF-8. This is key: "wide" character strings are actually held in UTF-8 encoded form.

What I'm calling the "wide" mode is known to Perl as "utf8". This is why.
the three lines (a to c) read from STDIN are all, by default also byte strings.
when printing these strings, Perl is simply sending the byte contents to the OS. If the output is sent to some device, and the device expects UTF-8, then we will see what we expect (otherwise, not). (In fact, the Perl IO layer is "downgrading" the wide string, but that's covered below.)

I was lucky. String "2" contains a \xF6 byte, which is not valid UTF-8, so is rendered as a "splodge" (▒), on my machine.
you do not need to use utf8 to make the various utf8::xxxx() functions available. In fact, you must not use utf8 for that purpose -- because use utf8 means something else, see below.

Before we start to worry about encoding and decoding and other magic, let's see how a character function works with what we have so far. Turning on the "uc($s)" option gives:

Options: uc($s) ;
 1: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD"
 2: 'HELLO W▒RLD' is 'byte' len=11 "HELLO W\xF6RLD"
Wide character in print at x.pl line 47.
 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
 4: 'HELLO WöRLD' is 'byte' len=12 "HELLO W\xC3\xB6RLD"
 5: 'HELLO WōRLD' is 'byte' len=12 "HELLO W\xC5\x8DRLD"
 a: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD"
 b: 'HELLO WöRLD' is 'byte' len=12 "HELLO W\xC3\xB6RLD"
 c: 'HELLO WōRLD' is 'byte' len=12 "HELLO W\xC5\x8DRLD"

showing that uc() is only interested in ASCII (on my machine, anyway) in the byte mode strings, but has done a wonderful Unicode job on the wide string.

Now, how do we convert these strings from byte to wide mode, so that the contents will be treated as Unicode characters ? One way is to use utf8::upgrade($s), which gives:

Options: utf8::upgrade($s) ; uc($s) ;
 1: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
 2: 'HELLO W▒RLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
Wide character in print at x.pl line 47.
 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
 4: 'HELLO WöRLD' is 'wide' len=12 "HELLO W\xC3\x83\xC2\xB6RLD"
 5: 'HELLO WōRLD' is 'wide' len=12 "HELLO W\xC3\x85\xC2\x8DRLD"
 a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
 b: 'HELLO WöRLD' is 'wide' len=12 "HELLO W\xC3\x83\xC2\xB6RLD"
 c: 'HELLO WōRLD' is 'wide' len=12 "HELLO W\xC3\x85\xC2\x8DRLD"

which is really wierd and unpleasant, but it does illustrate one piece of magic, concerning characters "\x80".."\xFF". In a byte mode string these characters appear exactly so. In a wide mode string these characters appear in their UTF-8 encoding, "\xC2\x80".."\xC3\xBF". Both forms, however, map to the same range of character ordinals, \x80..\xFF. So, Perl maps between the character encodings. Hence:

utf8::upgrade($s) when given a byte string, takes all byte values 0x80..0xFF and replaces them by the equivalent two byte UTF-8 sequence, and then sets the string to be "wide". utf8::upgrade($s) does nothing when given a string which is already "wide".

For string (2) this has translated "\xF6" to "\xC3\xB6" which uc($s) recognises, and upshifts to "\xC3\x96". This makes perfect sense if the string was in the Latin-1 character set -- so character 'ö' has been upshifted to 'Ö'. When printed it still shows as a "splodge", but see below.

Strings (4) & (5) and lines (a) & (b) from STDIN actually contain UTF-8 sequences, but utf8::upgrade($s) doesn't know that, and it translates each byte to its equivalent UTF-8 sequence. Not quite what one had in mind ! It so happens that the result is either already uppercase, or has no uppercase.
print is still expecting to output bytes. When given a wide mode string, it is happy to take UTF-8 sequences and translate them back to single bytes, except where the UTF-8 sequence gives an ordinal > 0xFF.

This is why string (2) still shows a "splodge". The IO Layers see "\xC3\x96" in a wide string, and translate that back down the single byte "\xD6", which isn't a valid UTF-8 sequence, so the device shows "splodge".

With string (3), print sees "\xC5\x8C" which cannot be translated to a single byte, so we get a warning message and the bytes "\xC5\x8C" are output unchanged.

With string (4), print sees "\xC3\x83\xC2\xB6" which translate back to "\xC3\xB6", which is what we started with ! Similarly string (5) and lines (b) & (c).

The message is: as far as Perl is concerned byte string characters "\x80".."\xFF" are interchangeable with wide string characters with UTF-8 sequences "\xC2\x80".."\xC3\xBF". The utf8::upgrade() and utf8::downgrade() functions do this. It also happens when Perl implicitly forces a string to wide or to byte -- as we've seen print do.

So, now we look at what use utf8 does. If we turn that on, and turn off the other options, the code gives:

Options: use utf8 ;
 1: 'Hello World' is 'byte' len=11 "Hello World"
 2: 'Hello W▒rld' is 'byte' len=11 "Hello W\xF6rld"
Wide character in print at x.pl line 47.
 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
 4: 'Hello W▒rld' is 'wide' len=11 "Hello W\xC3\xB6rld"
Wide character in print at x.pl line 47.
 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
 a: 'Hello World' is 'byte' len=11 "Hello World"
 b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld"
 c: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"

we can see an improvement.

What use utf8 does is to tell Perl to expect the source to be in UTF-8 form, and in particular to interpret UTF-8 sequences in literal strings. As shown above, strings (4) and (5) are now wide mode.

So far, so good. To sort out the printing we have to tell Perl to encode stuff as UTF-8, and we can do that on a per filehandle basis. Turning on the "STDOUT :encoding(utf8)" option, the code gives:

Options: use utf8 ; STDOUT :encoding(utf8) ;
 1: 'Hello World' is 'byte' len=11 "Hello World"
 2: 'Hello Wörld' is 'byte' len=11 "Hello W\xF6rld"
 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
 4: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld"
 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
 a: 'Hello World' is 'byte' len=11 "Hello World"
 b: 'Hello WÃ¶rld' is 'byte' len=12 "Hello W\xC3\xB6rld"
 c: 'Hello WÅ�rld' is 'byte' len=12 "Hello W\xC5\x8Drld"

Nearly there, most characters are showing as desired -- no more "splodges" -- and if you examine the output you will see that we're getting UTF-8 everywhere. But the lines input from STDIN now look odd.

Note especially string (2). This is in byte form. When printed with ':encoding(utf8)', byte strings are implicitly "upgraded" to UTF-8 -- remembering that this is implicitly treating the byte values as being in Latin-1 character set.

The lines (b) & (c) are also in byte form, and those two are implicitly "upgraded" to UTF-8

Turning on the "STDIN :encoding(utf8)" option, the code gives:

Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ;
 1: 'Hello World' is 'byte' len=11 "Hello World"
 2: 'Hello Wörld' is 'byte' len=11 "Hello W\xF6rld"
 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
 4: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld"
 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"
 a: 'Hello World' is 'wide' len=11 "Hello World"
 b: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld"
 c: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"

At last ! If we tell Perl that literal strings contain UTF-8 (use utf8), that the input is UTF-8 encoded (:encoding(utf8)) and the output should also be UTF-8 encoded -- then, surprise (!), we appear to get what we want.

Are we there yet ? Not quite. If we now try our uc($s) option, we get:

  
Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; uc($s) ;
 1: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD"
 2: 'HELLO WöRLD' is 'byte' len=11 "HELLO W\xF6RLD"
 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
 4: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
 5: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
 a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
 b: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
 c: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"

...everything is fine, except for the byte form string -- which came from the literal with the "\xF6" escape. We still need the "utf8::upgrade($s)" option, and so:

 
Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; utf8::upgrade($s) ; uc($s) ;
 1: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
 2: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
 4: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
 5: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"
 a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD"
 b: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD"
 c: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"

and at last we've succeeded in:

getting UTF-8 encoded literal strings, and from a file.
outputing UTF-8 encoded strings.
getting a representative character function to work on all these different strings.

Literal strings such as (2) above are a problem. If at some point they are not "upgraded", then they will not operate as intended. Some things (eg print) will implicitly "upgrade" byte strings. If a wide string and a byte string are processed together, the byte string will be implicitly upgraded. At other times a byte string will be processed as is. You may choose to always "upgrade" such strings as soon as they are assigned, or alway "upgrade" everything before running some wide character operation on it.

The following may, or may not, appeal:

  sub qu ($) { utf8::upgrade(my $s = $_[0]) ; return $s ; } ;

  show(6, qu "Hello W\xF6rld") ;
[download]

This area is complicated. Partly because wide character handling is inherently complicated, and generally unfamiliar. Also partly because Perl has to avoid breaking (too much) stuff which depends on the old, familiar byte string handling.

In the above I have tried to show how the various parts hang together and which part does what. The conclusion is that if you ensure that all sources and sinks of strings are correctly set to expect UTF-8, then things are pretty straightforward. Along the way, however, I have tried to show why all those are necessary.

For more on how to set filehandles to handle UTF-8 (and other) encodings, see open and binmode.

In reply to Re^3: function length() in UTF-8 context by gone2015
in thread function length() in UTF-8 context by didess

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl-Sensitive Sunglasses
	PerlMonks