in reply to Re: Interventionist Unicode Behaviors
in thread Interventionist Unicode Behaviors

Huh? In your snippet, perl is "just printing whatever gets thrown at it", without doing any sort of "translation" on it.

It's trying and failing to convert Unicode code point 0x263a to Latin-1. We see the warning because it's impossible to translate a code point that high to Latin 1.

I thought the example I gave was the easiest to grok, but this is probably better, because the output is actually different.

#!/usr/bin/perl use strict; use warnings; use Encode qw( _utf8_on ); my $resume = "r\xc3\xa9sum\xc3\xa9"; print $resume, "\n"; _utf8_on($resume); print $resume, "\n";
Conceptually, appending a non-UTF8 string to a UTF8 string is a really bad idea, bordering on stupid. Don't do that. (Why would you want to? What would you hope to accomplish as a result?)

I'd like to spit out scalars flagged as UTF8 by default from KinoSearch. But if I do that, that means anybody who gets that output is going to have to know how to deal with them. I don't want to spend all my time explaining the bottomless intricacies of Unicode handling in Perl to people. It's not that I want to be doing a lot of this concatenation, it's that I know it's going to happen some of the time and I don't want the support burden.

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

Replies are listed 'Best First'.
Re^3: Interventionist Unicode Behaviors
by Juerd (Abbot) on Sep 08, 2006 at 09:47 UTC

    It's trying and failing to convert Unicode code point 0x263a to Latin-1.

    No, it is not.

    You asked for the code points E2, 98 and BA, and you got them. You then manually messed around with the UTF8 flag. Because of your environment, Perl encoded the three-character string as latin-1, so the bytes were E2 98 BA, and so you are lucky. Then you set the UTF8 flag on, and finally you have that code point 263a, but you didn't get it the way you should have. When you print this string, however, there's no conversion going on AT ALL, because you never specified what to convert TO!

    Perl has no choice but to dump its internal representation to STDOUT, but is friendly enough to warn you that this output may not be what you want, because it doesn't know what you want.

    We see the warning because it's impossible to translate a code point that high to Latin 1.

    No, we see the warning because you're printing something that has the UTF8 flag set (and thus with certainty is a text string), to a filehandle that doesn't have an encoding attached to it.

    I don't want to spend all my time explaining the bottomless intricacies of Unicode handling in Perl to people.

    Neither do we, but apparently you INSIST that you use the internals directly instead of the way things were intended, so we have to explain to you these bottomless inticacies of Unicode handling in Perl's internals if you're ever to understand what the heck your broken code really does.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      You asked for the code points E2, 98 and BA, and you got them. You then manually messed around with the UTF8 flag. Because of your environment, Perl encoded the three-character string as latin-1, so the bytes were E2 98 BA, and so you are lucky.

      Please interpret that sequence of commands as a round-about way of specifying that I wanted a particular sequence of bytes in the scalar's PV. I could also have written it like this:

      use Inline C => <<'END_C'; SV * get_smiley() { SV *const smiley_sv = newSV(3); SvPOK_on(smiley_sv); SvUTF8_on(smiley_sv); unsigned char *ptr = (unsigned char*)SvPVX(smiley_sv); *ptr++ = 0xE2; *ptr++ = 0x98; *ptr++ = 0xBA; *ptr = 0x00; SvCUR_set(smiley_sv, 3); return smiley_sv; } END_C my $smiley = get_smiley();
      No, we see the warning because you're printing something that has the UTF8 flag set (and thus with certainty is a text string), to a filehandle that doesn't have an encoding attached to it.

      Please refer to the message with the "résumé" sample. In that sample, a scalar with the UTF8 flag set is printed to a filehandle that has not had an encoding explicitly attached to it. No warning occurs.

      we have to explain to you these bottomless inticacies of Unicode handling in Perl's internals if you're ever to understand what the heck your broken code really does.

      Are you implying that I broke that code accidentally? ;)

      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com

        Please interpret that sequence of commands as a round-about way of specifying that I wanted a particular sequence of bytes in the scalar's PV. I could also have written it like this: (inline C code)

        That would have been a more correct way, although it's even more round-about. The typical way of requesting a particular sequence of bytes is:

        pack "C*", 0xE2, 0x98, 0xBA.

        Are you implying that I broke that code accidentally? ;)

        Why would you break anything on purpose, and not mention that you did? That's a terrible waste of other people's time.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Re^3: Interventionist Unicode Behaviors
by graff (Chancellor) on Sep 08, 2006 at 10:25 UTC
    It's trying and failing to convert Unicode code point 0x263a to Latin-1. We see the warning because it's impossible to translate a code point that high to Latin 1.
    No, that first snippet in the OP is not trying to convert Unicode U263A to Latin-1; the wide-character warning is simply telling you that you have a string that perl is treating as utf8 data, and it's being printed to a file handle that has not been set up for that. You're right that it would be impossible to translate that code point to Latin-1, but since the snippet is not doing that, it's not an issue here.

    As for your snippet with "$resume", that demonstrates a "feature" of Perl's internal character representation that I was unaware of until now -- thanks for pointing this out. To clarify, the actual byte sequence differs in the two print statements, as follows:

    # first output line: 72 c3 a9 73 75 6d c3 a9 0a r c3 a9 s u m c3 a9 nl # second output line: 72 e9 73 75 6d e9 0a r e9 s u m e9 nl
    So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8.

    (Update: Well, on second thought, maybe I'm still not so clear on this myself; the fact that "\xc3\xa9" turns into "\xe9" -- the single-byte Latin-1 é -- because you hit it with the perl-internal "_utf8_on" function and print it to a non-utf8 file handle... that is some heavy-weight voodoo. I'm with Juerd: don't play with _utf8_on -- you should have seen and heeded the warning about that in the Encode docs.)

    Actually, I had already been aware of that (in some sense), but I had not seen its effect on file output. If you put  binmode STDOUT, ":utf8" between the two print statements (to coincide with upgrading the string to utf8), the byte sequences of the two outputs would be identical.

    Now I understand much better what the rationale is behind the "wide character" warnings -- the behavior demonstrated here is a case of character data that ought to be interpreted as utf8 on output (because it has the utf8 flag turned on), but is not being so interpreted (so it comes out as non-utf8 data, i.e. ill-formed/undisplayable).

    So the basis of this trouble is not specifically PerlIO layers, but rather Perl's current internal representation of this byte/character range, and how that interacts with "default" vs. "utf8" file handles.

    It's a difficult, tricky situation... As you demonstrate, leaving STDOUT in its default state throughout causes one kind of problem. But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong":

    # first line (after binmode STDOUT, ":utf8") 72 c3 83 c2 a9 73 75 6d c3 83 c2 a9 0a r c3 83 c2 a9 s u m c3 83 c2 a9 nl # second line (still ":utf8") 72 c3 a9 73 75 6d c3 a9 0a + r c3 a9 s u m c3 a9 nl
    I'd like to spit out scalars flagged as UTF8 by default from KinoSearch...

    I've got a reply to that elsewhere in this thread. (I'm full of replies tonight, it seems.)

      So, the "quasi-ambiguous" nature of bytes/characters in the \x80-\xff (U0080-U00FF) range seems deeper, subtler, more strange than I expected: for this set, perl's internal representation is still single-byte, not really utf8.

      It may be either single byte or UTF8, depending on your environment (pragmas). This is NO PROBLEM if you properly decode all your input, and encode all your output. This is not a bug, but a feature that is much needed for backwards compatibility with old code.

      But if it were set to ":utf8" before the first print statment, the two outputs would again be different, but in a different way, and the first one would be "wrong":

      Before the "_utf8_on", which I stress is a BAD IDEA, the string is latin-1. It's converted to UTF-8 as the binmode requested: C3 becomes C3 83 and A9 becomes C2 A9, etcetera. With the "_utf8_on" you tell Perl that, no, it's not latin-1, but UTF-8. And since that matches the output encoding, Perl no longer has any need to convert anything.

      In other words, first the string is "résumé\n", which when printed is encoded into UTF-8 as 72 C3 83 C2 A9 73 75 6d C3 83 C2 A9 0A, then someone messes with the internals and all of a sudden the string is "résumé\n", already UTF-8 encoded as 72 C3 A9 73 75 6d C3 A9 0A. (Two digits per byte, one underline per character)

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        This is NO PROBLEM if you properly decode all your input, and encode all your output.
        And here is where the Unicode-istas go wrong. Every single piece of software on this 'ere machine - and indeed all the machines I use regularly - was packaged well after Unicode became fashionable. In fact, a great deal of it has either been written from scratch or at least received patches, often large ones, since Unicode became fashionable. And yet Unicode doesn't "Just Work". It should, and requiring me to dick about just so I can see non-ASCII characters reliably is a bug.
      No, that first snippet in the OP is not trying to convert Unicode U263A to Latin-1; the wide-character warning is simply telling you that you have a string that perl is treating as utf8 data, and it's being printed to a file handle that has not been set up for that.
      This isn't quite true. AFAICT the wide-character warning is only given when printing a string that can't be converted to Latin1 on a filehandle that doesn't have an encoding specified. If it can be converted to Latin1, it is, and there is no warning. If not, the utf8 encoding is output. (Filehandles with encoding specified will give a warning like "\x{0391}" does not map to iso-8859-15. and output (in this case) the literal 8 characters "\x{0391}".)