Re^3: Interventionist Unicode Behaviors

It's trying and failing to convert Unicode code point 0x263a to Latin-1. We see the warning because it's impossible to translate a code point that high to Latin 1.

No, that first snippet in the OP is not trying to convert Unicode U263A to Latin-1; the wide-character warning is simply telling you that you have a string that perl is treating as utf8 data, and it's being printed to a file handle that has not been set up for that. You're right that it would be impossible to translate that code point to Latin-1, but since the snippet is not doing that, it's not an issue here.

As for your snippet with "$resume", that demonstrates a "feature" of Perl's internal character representation that I was unaware of until now -- thanks for pointing this out. To clarify, the actual byte sequence differs in the two print statements, as follows:

# first output line:
72  c3  a9  73  75  6d  c3  a9  0a
 r  c3  a9   s   u   m  c3  a9  nl

# second output line:
72  e9  73  75  6d  e9  0a
 r  e9   s   u   m  e9  nl
[download]

So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8.

(Update: Well, on second thought, maybe I'm still not so clear on this myself; the fact that "\xc3\xa9" turns into "\xe9" -- the single-byte Latin-1 é -- because you hit it with the perl-internal "_utf8_on" function and print it to a non-utf8 file handle... that is some heavy-weight voodoo. I'm with Juerd: don't play with _utf8_on -- you should have seen and heeded the warning about that in the Encode docs.)

Actually, I had already been aware of that (in some sense), but I had not seen its effect on file output. If you put binmode STDOUT, ":utf8" between the two print statements (to coincide with upgrading the string to utf8), the byte sequences of the two outputs would be identical.

Now I understand much better what the rationale is behind the "wide character" warnings -- the behavior demonstrated here is a case of character data that ought to be interpreted as utf8 on output (because it has the utf8 flag turned on), but is not being so interpreted (so it comes out as non-utf8 data, i.e. ill-formed/undisplayable).

So the basis of this trouble is not specifically PerlIO layers, but rather Perl's current internal representation of this byte/character range, and how that interacts with "default" vs. "utf8" file handles.

It's a difficult, tricky situation... As you demonstrate, leaving STDOUT in its default state throughout causes one kind of problem. But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong":

# first line (after binmode STDOUT, ":utf8")
72  c3  83  c2  a9  73  75  6d  c3  83  c2  a9  0a
 r  c3  83  c2  a9   s   u   m  c3  83  c2  a9  nl  

# second line (still ":utf8")
72  c3  a9  73  75  6d  c3  a9  0a                                    
+    
 r  c3  a9   s   u   m  c3  a9  nl
[download]

I'd like to spit out scalars flagged as UTF8 by default from KinoSearch...

I've got a reply to that elsewhere in this thread. (I'm full of replies tonight, it seems.)

Comment on Re^3: Interventionist Unicode Behaviors Select or Download Code

Replies are listed 'Best First'.
Re^4: Interventionist Unicode Behaviors by Juerd (Abbot) on Sep 08, 2006 at 10:53 UTC
So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8. It may be either single byte or UTF8, depending on your environment (pragmas). This is NO PROBLEM if you properly decode all your input, and encode all your output. This is not a bug, but a feature that is much needed for backwards compatibility with old code. But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong": Before the "_utf8_on", which I stress is a BAD IDEA, the string is latin-1. It's converted to UTF-8 as the binmode requested: C3 becomes C3 83 and A9 becomes C2 A9, etcetera. With the "_utf8_on" you tell Perl that, no, it's not latin-1, but UTF-8. And since that matches the output encoding, Perl no longer has any need to convert anything. In other words, first the string is "rÃ©sumÃ©\n", which when printed is encoded into UTF-8 as 72 C3 83 C2 A9 73 75 6d C3 83 C2 A9 0A, then someone messes with the internals and all of a sudden the string is "résumé\n", already UTF-8 encoded as 72 C3 A9 73 75 6d C3 A9 0A. (Two digits per byte, one underline per character) Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^5: Interventionist Unicode Behaviors by DrHyde (Prior) on Sep 14, 2006 at 10:09 UTC
This is NO PROBLEM if you properly decode all your input, and encode all your output. And here is where the Unicode-istas go wrong. Every single piece of software on this 'ere machine - and indeed all the machines I use regularly - was packaged well after Unicode became fashionable. In fact, a great deal of it has either been written from scratch or at least received patches, often large ones, since Unicode became fashionable. And yet Unicode doesn't "Just Work". It should, and requiring me to dick about just so I can see non-ASCII characters reliably is a bug.	[reply]
Re^6: Interventionist Unicode Behaviors by Juerd (Abbot) on Sep 15, 2006 at 02:34 UTC
The only way of having Unicode/UTF-8 work automatically, by default, without being explicit about it, is assuming that every string is UTF-8 encoded. Such a naive view of the world would have broken most of the gazillion Perl programs and modules that already existed, and would make it hard to ever pick a new default: the iso-8859-1 problem all over again. I, for one, am very happy that Perl chose to implement Unicode, not UTF-8, and to implement character sets, not UTF-8. As a result, we do get UTF-8 in a very simple and straightforward way, without breaking backwards and future compatibility. Through its character encoding framework, Perl has reached a much higher level of Unicode support than any other dynamic language has so far. All this, without introducing types or assuming anything. Joel Spolsky is absolutely right when he writes It does not make sense to have a [text] string without knowing what encoding it uses.. And so, we shouldn't assume any particular character set. Well, we must assume iso-8859-1 by default because in practice, Perl (and many CPAN modules) has always done so, and we want to maintain compatibility. And because the codepoints of the incompatible bytes are so nicely equivalent that we can safely upgrade these strings. Character encodings can never "Just Work". That's not because of Perl, but because of the rest of the world. More specifically, because a lot of (incompatible) character encodings exist. That's tough, and we have to live with it. Fortunately, Perl makes that easy. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^4: Interventionist Unicode Behaviors by ysth (Canon) on Sep 10, 2006 at 07:15 UTC
No, that first snippet in the OP is not trying to convert Unicode U263A to Latin-1; the wide-character warning is simply telling you that you have a string that perl is treating as utf8 data, and it's being printed to a file handle that has not been set up for that. This isn't quite true. AFAICT the wide-character warning is only given when printing a string that can't be converted to Latin1 on a filehandle that doesn't have an encoding specified. If it can be converted to Latin1, it is, and there is no warning. If not, the utf8 encoding is output. (Filehandles with encoding specified will give a warning like `"\x{0391}" does not map to iso-8859-15.` and output (in this case) the literal 8 characters "`\x{0391}`".)	[reply] [d/l] [select]