Perl-Sensitive Sunglasses | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
It's tricky, isn't it. Two things make this tricky, first it can be difficult to see where bytes and characters are being encoded/decoded, and second Perl handles "old-fashioned" strings of bytes as well as "new-fangled" wide characters. NB: the following applies to Perl v5.8.8 or later. There is, apparently, an EBCDIC Perl, which I know nothing about. The actors in this drama are:
The following attempts to show, one step at a time, how the input, output and handling of wide character strings can be achieved, starting with byte strings and working up. I confess this turned out to be a lot longer than I had expected/intended. I hope somebody will find it useful. Starting at the top, Perl handles two forms of string. The first form is, fundamentally, an array of unsigned char (in C terms) -- byte form. Where you ask Perl to interpret these as characters, it will assume at least ASCII (eg uc($s)) -- possibly more if you use deeper magic. The second form is, in effect, an array of wide character ordinals in the range 0..(2^32)-1 (roughly speaking). When you ask Perl to interpret these as characters, it will assume Unicode. Attached to every string is a "mode-bit", telling Perl whether it contains bytes or wide characters. The following will illustrate some of what is going on as Perl, Perl IO and the OS conspire together: use strict ; use warnings ; my $UT8 = 0 ; # Option to "use utf8" -- may be commented out, below my $ECO = 0 ; # Option to set ":encoding(utf8)" on STDOUT my $ECI = 0 ; # Option to set ":encoding(utf8)" on STDIN my $UPG = 0 ; # Option to "utf8::upgrade($s)" in show() my $UC = 0 ; # Option to "uc($s)" in show() foreach (@ARGV) { if ($_ eq 'eco') { $ECO = 1 ; } elsif ($_ eq 'eci') { $ECI = 1 ; } elsif ($_ eq 'upg') { $UPG = 1 ; } elsif ($_ eq 'uc') { $UC = 1 ; } else { die "$_ not known" ; } ; } ; #use utf8 ; $UT8 = 1 ; if ($ECO) { binmode STDOUT, ":encoding(utf8)" ; } ; if ($ECI) { binmode STDIN , ":encoding(utf8)" ; } ; my $m = '' ; if ($UT8) { $m .= " use utf8 ;" ; } ; if ($ECO) { $m .= " STDOUT :encoding(utf8) ;" ; } ; if ($ECI) { $m .= " STDIN :encoding(utf8) ;" ; } ; if ($UPG) { $m .= " utf8::upgrade(\$s) ;" ; } ; if ($UC) { $m .= " uc(\$s) ;" ; } ; if ($m) { print "Options:$m\n" ; } ; show(1, "Hello World") ; show(2, "Hello W\xF6rld") ; show(3, "Hello W\x{14D}rld") ; show(4, "Hello Wörld") ; # ord('ö') is 0xF6 show(5, "Hello Wōrld") ; # ord('ō') is 0x14D my $n = 'a' ; while (my $s = <STDIN>) { chomp($s) ; show($n++, $s) ; } ; sub show { my ($n, $s) = @_ ; if ($UPG) { utf8::upgrade($s) ; } ; if ($UC) { $s = uc($s) ; } ; print " $n: '$s' is ", utf8::is_utf8($s) ? "'wide'" : "'byte'", " len=", length($s), " \"", peek($s). "\"\n" ; } ; sub peek { # Peek at the byte contents of the given string my ($s) = @_ ; use bytes ; # Forces the unpack to show the byte contents of $s return join ('', map { ($_ >= 0x20) && ($_ < 0x7F) ? chr($_) : sprintf('\\x%02X', $_) } unpack('C*', $s)) ; } ;and the file read via STDIN is Hello World Hello Wörld Hello Wōrld With all the options off, on my machine the code above gave: 1: 'Hello World' is 'byte' len=11 "Hello World" 2: 'Hello W▒rld' is 'byte' len=11 "Hello W\xF6rld" Wide character in print at x.pl line 47. 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" 4: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld" 5: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld" a: 'Hello World' is 'byte' len=11 "Hello World" b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld" c: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"YMMV. This shows a number of things:
Before we start to worry about encoding and decoding and other magic, let's see how a character function works with what we have so far. Turning on the "uc($s)" option gives: Options: uc($s) ; 1: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD" 2: 'HELLO W▒RLD' is 'byte' len=11 "HELLO W\xF6RLD" Wide character in print at x.pl line 47. 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" 4: 'HELLO WöRLD' is 'byte' len=12 "HELLO W\xC3\xB6RLD" 5: 'HELLO WōRLD' is 'byte' len=12 "HELLO W\xC5\x8DRLD" a: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD" b: 'HELLO WöRLD' is 'byte' len=12 "HELLO W\xC3\xB6RLD" c: 'HELLO WōRLD' is 'byte' len=12 "HELLO W\xC5\x8DRLD"showing that uc() is only interested in ASCII (on my machine, anyway) in the byte mode strings, but has done a wonderful Unicode job on the wide string. Now, how do we convert these strings from byte to wide mode, so that the contents will be treated as Unicode characters ? One way is to use utf8::upgrade($s), which gives: Options: utf8::upgrade($s) ; uc($s) ; 1: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" 2: 'HELLO W▒RLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" Wide character in print at x.pl line 47. 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" 4: 'HELLO WöRLD' is 'wide' len=12 "HELLO W\xC3\x83\xC2\xB6RLD" 5: 'HELLO WōRLD' is 'wide' len=12 "HELLO W\xC3\x85\xC2\x8DRLD" a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" b: 'HELLO WöRLD' is 'wide' len=12 "HELLO W\xC3\x83\xC2\xB6RLD" c: 'HELLO WōRLD' is 'wide' len=12 "HELLO W\xC3\x85\xC2\x8DRLD"which is really wierd and unpleasant, but it does illustrate one piece of magic, concerning characters "\x80".."\xFF". In a byte mode string these characters appear exactly so. In a wide mode string these characters appear in their UTF-8 encoding, "\xC2\x80".."\xC3\xBF". Both forms, however, map to the same range of character ordinals, \x80..\xFF. So, Perl maps between the character encodings. Hence:
So, now we look at what use utf8 does. If we turn that on, and turn off the other options, the code gives: Options: use utf8 ; 1: 'Hello World' is 'byte' len=11 "Hello World" 2: 'Hello W▒rld' is 'byte' len=11 "Hello W\xF6rld" Wide character in print at x.pl line 47. 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" 4: 'Hello W▒rld' is 'wide' len=11 "Hello W\xC3\xB6rld" Wide character in print at x.pl line 47. 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" a: 'Hello World' is 'byte' len=11 "Hello World" b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld" c: 'Hello Wōrld' is 'byte' len=12 "Hello W\xC5\x8Drld"we can see an improvement. What use utf8 does is to tell Perl to expect the source to be in UTF-8 form, and in particular to interpret UTF-8 sequences in literal strings. As shown above, strings (4) and (5) are now wide mode. So far, so good. To sort out the printing we have to tell Perl to encode stuff as UTF-8, and we can do that on a per filehandle basis. Turning on the "STDOUT :encoding(utf8)" option, the code gives: Options: use utf8 ; STDOUT :encoding(utf8) ; 1: 'Hello World' is 'byte' len=11 "Hello World" 2: 'Hello Wörld' is 'byte' len=11 "Hello W\xF6rld" 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" 4: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld" 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" a: 'Hello World' is 'byte' len=11 "Hello World" b: 'Hello Wörld' is 'byte' len=12 "Hello W\xC3\xB6rld" c: 'Hello WÅ�rld' is 'byte' len=12 "Hello W\xC5\x8Drld"Nearly there, most characters are showing as desired -- no more "splodges" -- and if you examine the output you will see that we're getting UTF-8 everywhere. But the lines input from STDIN now look odd. Note especially string (2). This is in byte form. When printed with ':encoding(utf8)', byte strings are implicitly "upgraded" to UTF-8 -- remembering that this is implicitly treating the byte values as being in Latin-1 character set. The lines (b) & (c) are also in byte form, and those two are implicitly "upgraded" to UTF-8 Turning on the "STDIN :encoding(utf8)" option, the code gives: Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; 1: 'Hello World' is 'byte' len=11 "Hello World" 2: 'Hello Wörld' is 'byte' len=11 "Hello W\xF6rld" 3: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" 4: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld" 5: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld" a: 'Hello World' is 'wide' len=11 "Hello World" b: 'Hello Wörld' is 'wide' len=11 "Hello W\xC3\xB6rld" c: 'Hello Wōrld' is 'wide' len=11 "Hello W\xC5\x8Drld"At last ! If we tell Perl that literal strings contain UTF-8 (use utf8), that the input is UTF-8 encoded (:encoding(utf8)) and the output should also be UTF-8 encoded -- then, surprise (!), we appear to get what we want. Are we there yet ? Not quite. If we now try our uc($s) option, we get: Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; uc($s) ; 1: 'HELLO WORLD' is 'byte' len=11 "HELLO WORLD" 2: 'HELLO WöRLD' is 'byte' len=11 "HELLO W\xF6RLD" 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" 4: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" 5: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" b: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" c: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"...everything is fine, except for the byte form string -- which came from the literal with the "\xF6" escape. We still need the "utf8::upgrade($s)" option, and so: Options: use utf8 ; STDOUT :encoding(utf8) ; STDIN :encoding(utf8) ; utf8::upgrade($s) ; uc($s) ; 1: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" 2: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" 3: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" 4: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" 5: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD" a: 'HELLO WORLD' is 'wide' len=11 "HELLO WORLD" b: 'HELLO WÖRLD' is 'wide' len=11 "HELLO W\xC3\x96RLD" c: 'HELLO WŌRLD' is 'wide' len=11 "HELLO W\xC5\x8CRLD"and at last we've succeeded in:
Literal strings such as (2) above are a problem. If at some point they are not "upgraded", then they will not operate as intended. Some things (eg print) will implicitly "upgrade" byte strings. If a wide string and a byte string are processed together, the byte string will be implicitly upgraded. At other times a byte string will be processed as is. You may choose to always "upgrade" such strings as soon as they are assigned, or alway "upgrade" everything before running some wide character operation on it. The following may, or may not, appeal:
This area is complicated. Partly because wide character handling is inherently complicated, and generally unfamiliar. Also partly because Perl has to avoid breaking (too much) stuff which depends on the old, familiar byte string handling. In the above I have tried to show how the various parts hang together and which part does what. The conclusion is that if you ensure that all sources and sinks of strings are correctly set to expect UTF-8, then things are pretty straightforward. Along the way, however, I have tried to show why all those are necessary. For more on how to set filehandles to handle UTF-8 (and other) encodings, see open and binmode. In reply to Re^3: function length() in UTF-8 context
by gone2015
|
|