What's the rationale behind some of Perl's aggressively "helpful" UTF-8 handling features?

The rationale is not what you think it is, but more importantly: Perl's UTF-8 handling is not what you think it is. From your code, I can only assume that you are rying to apply your character encoding knowledge to Perl without first learning about Perl's way of handling text strings.

Perl has two kinds of strings, but you can't see what kind of string a given string is. You have to keep track of this yourself.

The first kind is the default kind: the binary string. The second kind is: the text string. Please note that text strings do not have any encoding! (Though internally, it's utf-8)

#!/usr/bin/pe 1. my $smiley = "\xE2\x98\xBA"; 2. print $smiley . "\n"; 3. _utf8_on($smiley); 4. print $smiley . "\n";

  1. You assign three characters to $smiley. Internally, this may be encoded as latin-1 (three bytes) or as utf-8 (six bytes). Either way, they are three different characters, not one character consisting of three bytes!!
  2. You print the string, but there is no output encoding set on STDOUT, and you don't encode it explicitly. Perl can't know what to do, so it just outputs the bytes as they exist in the string. If it happened to be stored as latin-1 before, then the output will be three bytes. If it happened to be stored as utf-8 before, then the output will be six bytes.
  3. RED FLAG. Here you manually switch on the internal UTF8 flag. You should NEVER do this, unless you know all the details of Perl's UTF8 handling. If the string happened to be stored as latin-1 before, you're lucky because this sequence of bytes happens to also be valid utf-8: you get one smiley. If the string happened to be stored as utf-8 before, nothing happens because the UTF8 flag was already set.
  4. You print the string again. While before you might have had a byte encoding, you now certainly have a wide character in your string, and Perl warns you that you're doing something stupid: you're printing a string that has a wide character in it, so you should have encoded it explicitly.

As you can see, two very different things can happen to $smiley. This may seem useless and totally wrong, because it makes Perl unpredictable. However, if you code the way the Perl gods intended, things go right, and Unicode support in Perl proves to be incredibly helpful and not all that weird anymore.

First of all, your string "\xE2\x98\xBA" is wrong if you want a smiley. These three characters are:

U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX U+0098 START OF STRING U+00BA MASCULINE ORDINAL INDICATOR
If you wanted the smiley face, you should have asked for it!
U+263A WHITE SMILING FACE
There are several ways of doing this:

# The \x{...} notation
my $smiley = "\x{263A}";

# The \N{...} notation
use charnames ':full';
my $smiley = "\N{WHITE SMILING FACE}";

# A literal UTF-8 encoded smiley face in your code
use utf8;
my $smiley = "☺";
Secondly, when you print something, you should let Perl know in which encoding you would like your data. This can again be done in several ways:
# Explicit encode() use Encode qw(encode); print encode("UTF-8", $smiley): # Set the filehandle to the encoding UTF-8 binmode STDOUT, ":encoding(UTF-8)"; print $smiley; # Set the filehandle to :utf8, a shortcut syntax because you need it s +o often binmode STDOUT, ":utf8"; print $smiley
The thing to remember is that Perl does handling of encodings (like UTF-8) for you, and that you shouldn't do it yourself. You encoded your smiley face as UTF-8 yourself, and then let Perl know the three individual bytes. That's a possibility, but it's much better in many ways if you let Perl handle this for you.

Is there any reason why STDOUT shouldn't just print whatever gets thrown at it? How about, if I want it to perform automatic translation, I turn that feature on?

It's funny that you describe exactly what Perl does. It prints whatever it gets, whether that data makes sense or not. If you want automatic translation (You do!!), you have to turn that feature on.

Another "feature" that's bitten me in the butt is the silent "upgrading" (i.e. corruption) of non-UTF8 scalars when concatenated with UTF8 scalars...

It bites only if you do something wrong, or have your environment setup badly. In this case, you're doing something wrong. You see, a text string should always be encode()d explicitly before you concatenate it to any non-text string. The packed num is not a text string, but a sequence of bytes. If you need both these strings in one stream, obviously it's a binary stream, and you need to encode the text string using the required encoding. For example:

use utf8; use Socket qw(inet_aton); use Encode qw(encode); my $text_string = "Héllø wõrld!"; my $binary_string = inet_aton("127.0.0.1"); my $data_to_send; # Now, we need to send both in one go! # But we can't do that directly, because $text_string needs to be enco +ded first. # How shall we encode it? # As UTF-8? $data_to_send = encode("UTF-8", $text_string) . $binary_string; print $data_to_send; # As ISO-8859-1? $data_to_send = encode("ISO-8859-1", $text_string) . $binary_string; print $data_to_send; # As KOI8-R? $data_to_send = encode("KOI8-R", $text_string) . $binary_string; print $data_to_send; # Oops, these characters don't exist in KOI8-R, so Perl used question +marks. Heehee :)

Think there's any chance these behaviors could change in Perl 5.10? Is it worth bringing up on p5p?

No chance this will change, because it's a solid system. I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it.

Perl 6, on the other hand, doesn't have to be compatible with legacy code, and it has a very nice type system. The two combined make that Perl 6 can use two different string types, Buf and Str. They're very simple, and Perl will scream if you ever try to combine the two in concatenation.

my Buf $byte_string; my Str $text_string;

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }


In reply to Never touch (or look at) the UTF8 flag!! by Juerd
in thread Interventionist Unicode Behaviors by creamygoodness

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.