in reply to Re^2: Interventionist Unicode Behaviors
in thread Interventionist Unicode Behaviors

It's trying and failing to convert Unicode code point 0x263a to Latin-1.

No, it is not.

You asked for the code points E2, 98 and BA, and you got them. You then manually messed around with the UTF8 flag. Because of your environment, Perl encoded the three-character string as latin-1, so the bytes were E2 98 BA, and so you are lucky. Then you set the UTF8 flag on, and finally you have that code point 263a, but you didn't get it the way you should have. When you print this string, however, there's no conversion going on AT ALL, because you never specified what to convert TO!

Perl has no choice but to dump its internal representation to STDOUT, but is friendly enough to warn you that this output may not be what you want, because it doesn't know what you want.

We see the warning because it's impossible to translate a code point that high to Latin 1.

No, we see the warning because you're printing something that has the UTF8 flag set (and thus with certainty is a text string), to a filehandle that doesn't have an encoding attached to it.

I don't want to spend all my time explaining the bottomless intricacies of Unicode handling in Perl to people.

Neither do we, but apparently you INSIST that you use the internals directly instead of the way things were intended, so we have to explain to you these bottomless inticacies of Unicode handling in Perl's internals if you're ever to understand what the heck your broken code really does.

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

  • Comment on Re^3: Interventionist Unicode Behaviors

Replies are listed 'Best First'.
Re^4: Interventionist Unicode Behaviors
by creamygoodness (Curate) on Sep 08, 2006 at 12:29 UTC
    You asked for the code points E2, 98 and BA, and you got them. You then manually messed around with the UTF8 flag. Because of your environment, Perl encoded the three-character string as latin-1, so the bytes were E2 98 BA, and so you are lucky.

    Please interpret that sequence of commands as a round-about way of specifying that I wanted a particular sequence of bytes in the scalar's PV. I could also have written it like this:

    use Inline C => <<'END_C'; SV * get_smiley() { SV *const smiley_sv = newSV(3); SvPOK_on(smiley_sv); SvUTF8_on(smiley_sv); unsigned char *ptr = (unsigned char*)SvPVX(smiley_sv); *ptr++ = 0xE2; *ptr++ = 0x98; *ptr++ = 0xBA; *ptr = 0x00; SvCUR_set(smiley_sv, 3); return smiley_sv; } END_C my $smiley = get_smiley();
    No, we see the warning because you're printing something that has the UTF8 flag set (and thus with certainty is a text string), to a filehandle that doesn't have an encoding attached to it.

    Please refer to the message with the "résumé" sample. In that sample, a scalar with the UTF8 flag set is printed to a filehandle that has not had an encoding explicitly attached to it. No warning occurs.

    we have to explain to you these bottomless inticacies of Unicode handling in Perl's internals if you're ever to understand what the heck your broken code really does.

    Are you implying that I broke that code accidentally? ;)

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com

      Please interpret that sequence of commands as a round-about way of specifying that I wanted a particular sequence of bytes in the scalar's PV. I could also have written it like this: (inline C code)

      That would have been a more correct way, although it's even more round-about. The typical way of requesting a particular sequence of bytes is:

      pack "C*", 0xE2, 0x98, 0xBA.

      Are you implying that I broke that code accidentally? ;)

      Why would you break anything on purpose, and not mention that you did? That's a terrible waste of other people's time.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        Are you implying that I broke that code accidentally? ;)
        Why would you break anything on purpose, and not mention that you did? That's a terrible waste of other people's time.

        Juerd, I'm trying to keep things friendly, here. It's my style to fight fire with water. Hence the smiley and the oblique remark as a response to some rather nasty comments, even though something stronger might have been warranted.

        The code in the OP is "broken" in the sense that it triggers a warning. That was intentional. It's broken -- on purpose -- because the whole point of that snippet is to trigger the warning.

        You would also argue that it is broken because of the way that I constructed the example scalars. For me, how those scalars were constructed is a peripheral issue. For you that issue appears to be central. My code works fine as it is, and so I disagree: it is not "broken" in the way you assert. Nevertheless, in the future, I will adopt the pack technique you advocate for constructing binary strings, and I thank you and demerphq for bringing it to my attention.

        In the meantime, I would appreciate it if we could lower the temperature of this discussion. Nobody's perfect. You are obviously quite knowledgeable about Unicode and Perl (as I knew when I cited your tutorial), yet you have said things in this thread which are demonstrably wrong[1], and in the very post where you scold me for not knowing what the heck my broken code does. We're all here to learn, and I'm grateful for your more thoughtful posts. Hopefully we can continue to learn from each other in the future.

        [1] "No, we see the warning because you're printing something that has the UTF8 flag set (and thus with certainty is a text string), to a filehandle that doesn't have an encoding attached to it." If that were true, then this code would issue a warning:

        #!/usr/bin/perl use strict; use warnings; use Devel::Peek; use charnames ':full'; my $thorn = "\N{LATIN CAPITAL LETTER THORN}"; Dump $thorn; print $thorn; print "\n";
        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com