in reply to Interventionist Unicode Behaviors

For instance, STDOUT takes it upon itself to attempt translation of data rather than just printing what I want...

(code snippet)

Is there any reason why STDOUT shouldn't just print whatever gets thrown at it? How about, if I want it to perform automatic translation, I turn that feature on?

Huh? In your snippet, perl STDOUT is "just printing whatever gets thrown at it", without doing any sort of "translation" on it.

The first output is a three-byte sequence that, when viewed on a utf8-aware display, will show a single unicode character. If you want to see that as three separate bytes, print to something that does not do utf8 interpretation -- e.g. (in unix):

perl -e 'print "\xE2\x98\xBA"' | od -txC

In the second output, you've told perl that your three-byte string should be interpreted by perl internally as utf8 data, and then you print it to a file handle that has not been configured for that encoding, so you get the warning, but that's just a warning, and the output is effectively the same as it was before -- and how you see it will depend on what you use to view it.

(In perl 5.8.0, esp. with Red Hat, perl actually referred to the user's "locale" settings in order to "automagically" do utf8 conversion on output whenever the locale cited utf8; everyone quickly agreed that this was a Big MistakeTM, and the behavior was corrected in 5.8.1, never to return.)

Another "feature" that's bitten me in the butt is the silent "upgrading" (i.e. corruption) of non-UTF8 scalars when concatenated with UTF8 scalars...

Conceptually, appending a non-UTF8 string to a UTF8 string is a really bad idea, bordering on stupid. Don't do that. (Why would you want to? What would you hope to accomplish as a result?)

Your second snippet shows the "special" (quasi-ambiguous) status of byte values in the \x80-\xFF range in perl 5.8: (<update>:) when used in a "raw" (non-utf8) context, they are treated simply as single byte values without further ado -- e.g.  print "\xA0" prints just one byte when STDOUT is in ":raw" (default) mode -- but (</update>) when used in a utf8 context (e.g. appended to a utf8 string or printed to a file handle that is set to utf8 mode), they are automatically "upgraded" to utf8 characters by changing the single byte to its two-byte utf8 equivalent. For people migrating out of iso-8859-1 into unicode (which is quite a few people, even now), this prevents a lot more trouble than it creates. Admittedly, a lot of people who don't yet understand unicode and/or utf8 can and do get into trouble with this.

As for your "preferred API", I don't think I understand what you are trying to demonstrate with the first two "print" statements. As for the third print statement ("$utf8 . $non_utf8"), if the latter scalar contains data that cannot be parsed as utf8, any utf8-aware display will simply put question-marks for the bytes that make no sense. That's what the Unicode Standard says is the appropriate thing to do; Perl will only tell you your non-utf8 data cannot be used directly as utf8 if/when you try to do:

decode( 'utf8', $non_utf8, Encode::FB_CROAK ); # or Encode::FB_WARN
or you can do the "default" decoding, without the third "check" parameter, and the resulting string will contain one or more \x{FFFD} unicode characters (rendered in three utf8 bytes, of course), which refers to a code point labeled "REPLACEMENT CHARACTER", which will either be ignored or show up as a question-mark, depending on what utf8-aware tool you use to view it.

If you have non-utf8 data and you want to "display" it using a utf8-aware terminal/window, you need to figure out how to make it intelligible, both to the displayer and to the user.

To get rid of the "wide character in print" warnings, do  binmode FILEHANDLE, ":utf8" or use the three-arg version of the "open" statement when opening an output file:  open FH, ">:utf8", $filename -- check the man page for "open" (perldoc -f open).

Replies are listed 'Best First'.
Re^2: Interventionist Unicode Behaviors
by creamygoodness (Curate) on Sep 08, 2006 at 08:54 UTC
    Huh? In your snippet, perl is "just printing whatever gets thrown at it", without doing any sort of "translation" on it.

    It's trying and failing to convert Unicode code point 0x263a to Latin-1. We see the warning because it's impossible to translate a code point that high to Latin 1.

    I thought the example I gave was the easiest to grok, but this is probably better, because the output is actually different.

    #!/usr/bin/perl use strict; use warnings; use Encode qw( _utf8_on ); my $resume = "r\xc3\xa9sum\xc3\xa9"; print $resume, "\n"; _utf8_on($resume); print $resume, "\n";
    Conceptually, appending a non-UTF8 string to a UTF8 string is a really bad idea, bordering on stupid. Don't do that. (Why would you want to? What would you hope to accomplish as a result?)

    I'd like to spit out scalars flagged as UTF8 by default from KinoSearch. But if I do that, that means anybody who gets that output is going to have to know how to deal with them. I don't want to spend all my time explaining the bottomless intricacies of Unicode handling in Perl to people. It's not that I want to be doing a lot of this concatenation, it's that I know it's going to happen some of the time and I don't want the support burden.

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com

      It's trying and failing to convert Unicode code point 0x263a to Latin-1.

      No, it is not.

      You asked for the code points E2, 98 and BA, and you got them. You then manually messed around with the UTF8 flag. Because of your environment, Perl encoded the three-character string as latin-1, so the bytes were E2 98 BA, and so you are lucky. Then you set the UTF8 flag on, and finally you have that code point 263a, but you didn't get it the way you should have. When you print this string, however, there's no conversion going on AT ALL, because you never specified what to convert TO!

      Perl has no choice but to dump its internal representation to STDOUT, but is friendly enough to warn you that this output may not be what you want, because it doesn't know what you want.

      We see the warning because it's impossible to translate a code point that high to Latin 1.

      No, we see the warning because you're printing something that has the UTF8 flag set (and thus with certainty is a text string), to a filehandle that doesn't have an encoding attached to it.

      I don't want to spend all my time explaining the bottomless intricacies of Unicode handling in Perl to people.

      Neither do we, but apparently you INSIST that you use the internals directly instead of the way things were intended, so we have to explain to you these bottomless inticacies of Unicode handling in Perl's internals if you're ever to understand what the heck your broken code really does.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        You asked for the code points E2, 98 and BA, and you got them. You then manually messed around with the UTF8 flag. Because of your environment, Perl encoded the three-character string as latin-1, so the bytes were E2 98 BA, and so you are lucky.

        Please interpret that sequence of commands as a round-about way of specifying that I wanted a particular sequence of bytes in the scalar's PV. I could also have written it like this:

        use Inline C => <<'END_C'; SV * get_smiley() { SV *const smiley_sv = newSV(3); SvPOK_on(smiley_sv); SvUTF8_on(smiley_sv); unsigned char *ptr = (unsigned char*)SvPVX(smiley_sv); *ptr++ = 0xE2; *ptr++ = 0x98; *ptr++ = 0xBA; *ptr = 0x00; SvCUR_set(smiley_sv, 3); return smiley_sv; } END_C my $smiley = get_smiley();
        No, we see the warning because you're printing something that has the UTF8 flag set (and thus with certainty is a text string), to a filehandle that doesn't have an encoding attached to it.

        Please refer to the message with the "résumé" sample. In that sample, a scalar with the UTF8 flag set is printed to a filehandle that has not had an encoding explicitly attached to it. No warning occurs.

        we have to explain to you these bottomless inticacies of Unicode handling in Perl's internals if you're ever to understand what the heck your broken code really does.

        Are you implying that I broke that code accidentally? ;)

        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com
      It's trying and failing to convert Unicode code point 0x263a to Latin-1. We see the warning because it's impossible to translate a code point that high to Latin 1.
      No, that first snippet in the OP is not trying to convert Unicode U263A to Latin-1; the wide-character warning is simply telling you that you have a string that perl is treating as utf8 data, and it's being printed to a file handle that has not been set up for that. You're right that it would be impossible to translate that code point to Latin-1, but since the snippet is not doing that, it's not an issue here.

      As for your snippet with "$resume", that demonstrates a "feature" of Perl's internal character representation that I was unaware of until now -- thanks for pointing this out. To clarify, the actual byte sequence differs in the two print statements, as follows:

      # first output line: 72 c3 a9 73 75 6d c3 a9 0a r c3 a9 s u m c3 a9 nl # second output line: 72 e9 73 75 6d e9 0a r e9 s u m e9 nl
      So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8.

      (Update: Well, on second thought, maybe I'm still not so clear on this myself; the fact that "\xc3\xa9" turns into "\xe9" -- the single-byte Latin-1 é -- because you hit it with the perl-internal "_utf8_on" function and print it to a non-utf8 file handle... that is some heavy-weight voodoo. I'm with Juerd: don't play with _utf8_on -- you should have seen and heeded the warning about that in the Encode docs.)

      Actually, I had already been aware of that (in some sense), but I had not seen its effect on file output. If you put  binmode STDOUT, ":utf8" between the two print statements (to coincide with upgrading the string to utf8), the byte sequences of the two outputs would be identical.

      Now I understand much better what the rationale is behind the "wide character" warnings -- the behavior demonstrated here is a case of character data that ought to be interpreted as utf8 on output (because it has the utf8 flag turned on), but is not being so interpreted (so it comes out as non-utf8 data, i.e. ill-formed/undisplayable).

      So the basis of this trouble is not specifically PerlIO layers, but rather Perl's current internal representation of this byte/character range, and how that interacts with "default" vs. "utf8" file handles.

      It's a difficult, tricky situation... As you demonstrate, leaving STDOUT in its default state throughout causes one kind of problem. But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong":

      # first line (after binmode STDOUT, ":utf8") 72 c3 83 c2 a9 73 75 6d c3 83 c2 a9 0a r c3 83 c2 a9 s u m c3 83 c2 a9 nl # second line (still ":utf8") 72 c3 a9 73 75 6d c3 a9 0a + r c3 a9 s u m c3 a9 nl
      I'd like to spit out scalars flagged as UTF8 by default from KinoSearch...

      I've got a reply to that elsewhere in this thread. (I'm full of replies tonight, it seems.)

        So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8.

        It may be either single byte or UTF8, depending on your environment (pragmas). This is NO PROBLEM if you properly decode all your input, and encode all your output. This is not a bug, but a feature that is much needed for backwards compatibility with old code.

        But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong":

        Before the "_utf8_on", which I stress is a BAD IDEA, the string is latin-1. It's converted to UTF-8 as the binmode requested: C3 becomes C3 83 and A9 becomes C2 A9, etcetera. With the "_utf8_on" you tell Perl that, no, it's not latin-1, but UTF-8. And since that matches the output encoding, Perl no longer has any need to convert anything.

        In other words, first the string is "résumé\n", which when printed is encoded into UTF-8 as 72 C3 83 C2 A9 73 75 6d C3 83 C2 A9 0A, then someone messes with the internals and all of a sudden the string is "résumé\n", already UTF-8 encoded as 72 C3 A9 73 75 6d C3 A9 0A. (Two digits per byte, one underline per character)

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        No, that first snippet in the OP is not trying to convert Unicode U263A to Latin-1; the wide-character warning is simply telling you that you have a string that perl is treating as utf8 data, and it's being printed to a file handle that has not been set up for that.
        This isn't quite true. AFAICT the wide-character warning is only given when printing a string that can't be converted to Latin1 on a filehandle that doesn't have an encoding specified. If it can be converted to Latin1, it is, and there is no warning. If not, the utf8 encoding is output. (Filehandles with encoding specified will give a warning like "\x{0391}" does not map to iso-8859-15. and output (in this case) the literal 8 characters "\x{0391}".)