in reply to Interventionist Unicode Behaviors
What's the rationale behind some of Perl's aggressively "helpful" UTF-8 handling features?
The rationale is not what you think it is, but more importantly: Perl's UTF-8 handling is not what you think it is. From your code, I can only assume that you are rying to apply your character encoding knowledge to Perl without first learning about Perl's way of handling text strings.
Perl has two kinds of strings, but you can't see what kind of string a given string is. You have to keep track of this yourself.
The first kind is the default kind: the binary string. The second kind is: the text string. Please note that text strings do not have any encoding! (Though internally, it's utf-8)
#!/usr/bin/pe 1. my $smiley = "\xE2\x98\xBA"; 2. print $smiley . "\n"; 3. _utf8_on($smiley); 4. print $smiley . "\n";
As you can see, two very different things can happen to $smiley. This may seem useless and totally wrong, because it makes Perl unpredictable. However, if you code the way the Perl gods intended, things go right, and Unicode support in Perl proves to be incredibly helpful and not all that weird anymore.
First of all, your string "\xE2\x98\xBA" is wrong if you want a smiley. These three characters are:
If you wanted the smiley face, you should have asked for it!U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX U+0098 START OF STRING U+00BA MASCULINE ORDINAL INDICATOR
There are several ways of doing this:U+263A WHITE SMILING FACE
# The \x{...} notation
my $smiley = "\x{263A}";
# The \N{...} notation
use charnames ':full';
my $smiley = "\N{WHITE SMILING FACE}";
# A literal UTF-8 encoded smiley face in your code
use utf8;
my $smiley = "☺";
Secondly, when you print something, you should let Perl know in which encoding you would like your data. This can again be done in several ways:
The thing to remember is that Perl does handling of encodings (like UTF-8) for you, and that you shouldn't do it yourself. You encoded your smiley face as UTF-8 yourself, and then let Perl know the three individual bytes. That's a possibility, but it's much better in many ways if you let Perl handle this for you.# Explicit encode() use Encode qw(encode); print encode("UTF-8", $smiley): # Set the filehandle to the encoding UTF-8 binmode STDOUT, ":encoding(UTF-8)"; print $smiley; # Set the filehandle to :utf8, a shortcut syntax because you need it s +o often binmode STDOUT, ":utf8"; print $smiley
Is there any reason why STDOUT shouldn't just print whatever gets thrown at it? How about, if I want it to perform automatic translation, I turn that feature on?
It's funny that you describe exactly what Perl does. It prints whatever it gets, whether that data makes sense or not. If you want automatic translation (You do!!), you have to turn that feature on.
Another "feature" that's bitten me in the butt is the silent "upgrading" (i.e. corruption) of non-UTF8 scalars when concatenated with UTF8 scalars...
It bites only if you do something wrong, or have your environment setup badly. In this case, you're doing something wrong. You see, a text string should always be encode()d explicitly before you concatenate it to any non-text string. The packed num is not a text string, but a sequence of bytes. If you need both these strings in one stream, obviously it's a binary stream, and you need to encode the text string using the required encoding. For example:
use utf8; use Socket qw(inet_aton); use Encode qw(encode); my $text_string = "Héllø wõrld!"; my $binary_string = inet_aton("127.0.0.1"); my $data_to_send; # Now, we need to send both in one go! # But we can't do that directly, because $text_string needs to be enco +ded first. # How shall we encode it? # As UTF-8? $data_to_send = encode("UTF-8", $text_string) . $binary_string; print $data_to_send; # As ISO-8859-1? $data_to_send = encode("ISO-8859-1", $text_string) . $binary_string; print $data_to_send; # As KOI8-R? $data_to_send = encode("KOI8-R", $text_string) . $binary_string; print $data_to_send; # Oops, these characters don't exist in KOI8-R, so Perl used question +marks. Heehee :)
Think there's any chance these behaviors could change in Perl 5.10? Is it worth bringing up on p5p?
No chance this will change, because it's a solid system. I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it.
Perl 6, on the other hand, doesn't have to be compatible with legacy code, and it has a very nice type system. The two combined make that Perl 6 can use two different string types, Buf and Str. They're very simple, and Perl will scream if you ever try to combine the two in concatenation.
my Buf $byte_string; my Str $text_string;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Never touch (or look at) the UTF8 flag!!
by creamygoodness (Curate) on Sep 08, 2006 at 10:58 UTC | |
by demerphq (Chancellor) on Sep 08, 2006 at 12:51 UTC | |
by creamygoodness (Curate) on Sep 08, 2006 at 13:43 UTC | |
by Juerd (Abbot) on Sep 11, 2006 at 09:32 UTC | |
by creamygoodness (Curate) on Sep 11, 2006 at 19:48 UTC | |
by Juerd (Abbot) on Sep 11, 2006 at 22:11 UTC |