in reply to Re^2: Interventionist Unicode Behaviors
in thread Interventionist Unicode Behaviors
It's trying and failing to convert Unicode code point 0x263a to Latin-1. We see the warning because it's impossible to translate a code point that high to Latin 1.No, that first snippet in the OP is not trying to convert Unicode U263A to Latin-1; the wide-character warning is simply telling you that you have a string that perl is treating as utf8 data, and it's being printed to a file handle that has not been set up for that. You're right that it would be impossible to translate that code point to Latin-1, but since the snippet is not doing that, it's not an issue here.
As for your snippet with "$resume", that demonstrates a "feature" of Perl's internal character representation that I was unaware of until now -- thanks for pointing this out. To clarify, the actual byte sequence differs in the two print statements, as follows:
So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8.# first output line: 72 c3 a9 73 75 6d c3 a9 0a r c3 a9 s u m c3 a9 nl # second output line: 72 e9 73 75 6d e9 0a r e9 s u m e9 nl
(Update: Well, on second thought, maybe I'm still not so clear on this myself; the fact that "\xc3\xa9" turns into "\xe9" -- the single-byte Latin-1 é -- because you hit it with the perl-internal "_utf8_on" function and print it to a non-utf8 file handle... that is some heavy-weight voodoo. I'm with Juerd: don't play with _utf8_on -- you should have seen and heeded the warning about that in the Encode docs.)
Actually, I had already been aware of that (in some sense), but I had not seen its effect on file output. If you put binmode STDOUT, ":utf8" between the two print statements (to coincide with upgrading the string to utf8), the byte sequences of the two outputs would be identical.
Now I understand much better what the rationale is behind the "wide character" warnings -- the behavior demonstrated here is a case of character data that ought to be interpreted as utf8 on output (because it has the utf8 flag turned on), but is not being so interpreted (so it comes out as non-utf8 data, i.e. ill-formed/undisplayable).
So the basis of this trouble is not specifically PerlIO layers, but rather Perl's current internal representation of this byte/character range, and how that interacts with "default" vs. "utf8" file handles.
It's a difficult, tricky situation... As you demonstrate, leaving STDOUT in its default state throughout causes one kind of problem. But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong":
# first line (after binmode STDOUT, ":utf8") 72 c3 83 c2 a9 73 75 6d c3 83 c2 a9 0a r c3 83 c2 a9 s u m c3 83 c2 a9 nl # second line (still ":utf8") 72 c3 a9 73 75 6d c3 a9 0a + r c3 a9 s u m c3 a9 nl
I'd like to spit out scalars flagged as UTF8 by default from KinoSearch...
I've got a reply to that elsewhere in this thread. (I'm full of replies tonight, it seems.)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Interventionist Unicode Behaviors
by Juerd (Abbot) on Sep 08, 2006 at 10:53 UTC | |
by DrHyde (Prior) on Sep 14, 2006 at 10:09 UTC | |
by Juerd (Abbot) on Sep 15, 2006 at 02:34 UTC | |
|
Re^4: Interventionist Unicode Behaviors
by ysth (Canon) on Sep 10, 2006 at 07:15 UTC |