in reply to Interventionist Unicode Behaviors

What's the rationale behind some of Perl's aggressively "helpful" UTF-8 handling features?

The rationale is not what you think it is, but more importantly: Perl's UTF-8 handling is not what you think it is. From your code, I can only assume that you are rying to apply your character encoding knowledge to Perl without first learning about Perl's way of handling text strings.

Perl has two kinds of strings, but you can't see what kind of string a given string is. You have to keep track of this yourself.

The first kind is the default kind: the binary string. The second kind is: the text string. Please note that text strings do not have any encoding! (Though internally, it's utf-8)

#!/usr/bin/pe 1. my $smiley = "\xE2\x98\xBA"; 2. print $smiley . "\n"; 3. _utf8_on($smiley); 4. print $smiley . "\n";

  1. You assign three characters to $smiley. Internally, this may be encoded as latin-1 (three bytes) or as utf-8 (six bytes). Either way, they are three different characters, not one character consisting of three bytes!!
  2. You print the string, but there is no output encoding set on STDOUT, and you don't encode it explicitly. Perl can't know what to do, so it just outputs the bytes as they exist in the string. If it happened to be stored as latin-1 before, then the output will be three bytes. If it happened to be stored as utf-8 before, then the output will be six bytes.
  3. RED FLAG. Here you manually switch on the internal UTF8 flag. You should NEVER do this, unless you know all the details of Perl's UTF8 handling. If the string happened to be stored as latin-1 before, you're lucky because this sequence of bytes happens to also be valid utf-8: you get one smiley. If the string happened to be stored as utf-8 before, nothing happens because the UTF8 flag was already set.
  4. You print the string again. While before you might have had a byte encoding, you now certainly have a wide character in your string, and Perl warns you that you're doing something stupid: you're printing a string that has a wide character in it, so you should have encoded it explicitly.

As you can see, two very different things can happen to $smiley. This may seem useless and totally wrong, because it makes Perl unpredictable. However, if you code the way the Perl gods intended, things go right, and Unicode support in Perl proves to be incredibly helpful and not all that weird anymore.

First of all, your string "\xE2\x98\xBA" is wrong if you want a smiley. These three characters are:

U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX U+0098 START OF STRING U+00BA MASCULINE ORDINAL INDICATOR
If you wanted the smiley face, you should have asked for it!
U+263A WHITE SMILING FACE
There are several ways of doing this:

# The \x{...} notation
my $smiley = "\x{263A}";

# The \N{...} notation
use charnames ':full';
my $smiley = "\N{WHITE SMILING FACE}";

# A literal UTF-8 encoded smiley face in your code
use utf8;
my $smiley = "☺";
Secondly, when you print something, you should let Perl know in which encoding you would like your data. This can again be done in several ways:
# Explicit encode() use Encode qw(encode); print encode("UTF-8", $smiley): # Set the filehandle to the encoding UTF-8 binmode STDOUT, ":encoding(UTF-8)"; print $smiley; # Set the filehandle to :utf8, a shortcut syntax because you need it s +o often binmode STDOUT, ":utf8"; print $smiley
The thing to remember is that Perl does handling of encodings (like UTF-8) for you, and that you shouldn't do it yourself. You encoded your smiley face as UTF-8 yourself, and then let Perl know the three individual bytes. That's a possibility, but it's much better in many ways if you let Perl handle this for you.

Is there any reason why STDOUT shouldn't just print whatever gets thrown at it? How about, if I want it to perform automatic translation, I turn that feature on?

It's funny that you describe exactly what Perl does. It prints whatever it gets, whether that data makes sense or not. If you want automatic translation (You do!!), you have to turn that feature on.

Another "feature" that's bitten me in the butt is the silent "upgrading" (i.e. corruption) of non-UTF8 scalars when concatenated with UTF8 scalars...

It bites only if you do something wrong, or have your environment setup badly. In this case, you're doing something wrong. You see, a text string should always be encode()d explicitly before you concatenate it to any non-text string. The packed num is not a text string, but a sequence of bytes. If you need both these strings in one stream, obviously it's a binary stream, and you need to encode the text string using the required encoding. For example:

use utf8; use Socket qw(inet_aton); use Encode qw(encode); my $text_string = "Héllø wõrld!"; my $binary_string = inet_aton("127.0.0.1"); my $data_to_send; # Now, we need to send both in one go! # But we can't do that directly, because $text_string needs to be enco +ded first. # How shall we encode it? # As UTF-8? $data_to_send = encode("UTF-8", $text_string) . $binary_string; print $data_to_send; # As ISO-8859-1? $data_to_send = encode("ISO-8859-1", $text_string) . $binary_string; print $data_to_send; # As KOI8-R? $data_to_send = encode("KOI8-R", $text_string) . $binary_string; print $data_to_send; # Oops, these characters don't exist in KOI8-R, so Perl used question +marks. Heehee :)

Think there's any chance these behaviors could change in Perl 5.10? Is it worth bringing up on p5p?

No chance this will change, because it's a solid system. I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it.

Perl 6, on the other hand, doesn't have to be compatible with legacy code, and it has a very nice type system. The two combined make that Perl 6 can use two different string types, Buf and Str. They're very simple, and Perl will scream if you ever try to combine the two in concatenation.

my Buf $byte_string; my Str $text_string;

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Replies are listed 'Best First'.
Re: Never touch (or look at) the UTF8 flag!!
by creamygoodness (Curate) on Sep 08, 2006 at 10:58 UTC

    Juerd,

    RED FLAG. Here you manually switch on the internal UTF8 flag. You should NEVER do this, unless you know all the details of Perl's UTF8 handling. If the string happened to be stored as latin-1 before, you're lucky because this sequence of bytes happens to also be valid utf-8: you get one smiley. If the string happened to be stored as utf-8 before, nothing happens because the UTF8 flag was already set.

    Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes. Then I turned on the UTF8 flag and got exactly what I wanted: a Perl scalar with a PV of 0xE2 0x98 0xBA 0x00, a LEN of 4, a CUR of 3, with the SVf_UTF8 flag set unset, yada yada. I chose not to represent the string as "\x{263a}" or "\N{WHITE SMILING FACE}" because in both of those cases the SVf_UTF8 flag would have been set -- whereas by using raw hex notation, I coerced Perl into parsing the string using byte semantics.

    #!/usr/bin/perl use strict; use warnings; use Devel::Peek; my $bytes = "\xE2\x98\xBA"; my $uni = "\x{263a}"; # Only one difference between these: the UTF8 flag is on for $uni Dump($bytes); Dump($uni); __END__ Outputs: SV = PV(0x1801660) at 0x180b584 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300bd0 "\342\230\272"\0 CUR = 3 LEN = 4 SV = PV(0x1801678) at 0x180b560 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x316c80 "\342\230\272"\0 [UTF8 "\x{263a}"] CUR = 3 LEN = 4

    I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using.

    However switching SVf_UTF8 on and off is not something I do lightly, or that I would recommend to the casual user, so there we are in agreement.

    I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it.

    The basic system is not mysterious. SVf_UTF8 is either on or it isn't. (and if it's on, it better be right :).

    Perl 6 can use two different string types, Buf and Str.
    If Str is limited to Unicode and only Unicode, that's Nirvana...
    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com

      # Only one difference between these: the UTF8 flag is on for $uni

      As far as I understand it this need not be true. You make the comment that you do a lot of XS programming. It seems to me that you are equating the XS concept of altering the PV with the perl concept of assigning to a string. Which with byte semantics is correct, but with utf8 semantics is not. I dont believe that there is an guarantee that this will always be true. For instance perl 5.12 could be entirley unicode internally and your program would break. Likewise, if the string were utf8-on before you did the assignment the result would be different. I think probably if you want to operate on the level you seem to I think you should use pack.

      Also a little nit: perl doesnt do UTF-8, it does utf8, which is subtlely different from true UTF-8. Although in the context here I dont think it matters.

      ---
      $world=~s/war/peace/g

        Can Perl 5.12 start interpreting string literals with utf8 semantics without shattering backwards compatibility? I didn't think that was likely to happen. If it does it won't just be my examples here that break! :)

        I tried to manufacture an example to demo your assertion that if the string were utf8-on prior to assignment the result would differ, but I haven't been able to come up with anything. Certainly concatenating onto the end of an existing string the results would differ, but wholesale assignment?? How would you change the following example to illustrate your point?

        #!/usr/bin/perl use strict; use warnings; use Encode qw( _utf8_on ); use Devel::Peek qw( Dump ); my $foo; $foo = 'foo'; Dump($foo); my $bar; _utf8_on($bar); $bar = 'bar'; Dump($bar); __END__ Outputs: SV = PV(0x1801660) at 0x180b59c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x365f20 "foo"\0 CUR = 3 LEN = 4 SV = PV(0x1801678) at 0x180b56c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300bd0 "bar"\0 CUR = 3 LEN = 4
        I also have to confess that I don't understand what you mean when you say that altering the PV is not the same as assigning to a string under utf8 semantics. Certainly you'd need to alter the PV (and other values, possibly flags too) following the rules of unicode handling -- for instance, checking for illegal sequences at splice points. Is that what you mean? I've spelunked a lot of the utf8 code in sv.c and I don't recall having seen any constructs I didn't recognize.

        And thanks for the suggestion on pack. Maybe if I'd used that in the OP there'd be less blood on the floor! :)

        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com

      Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes.

      If this piece of code was in a module, and the main script disagreed about the encoding, things would have broken. For example, if "use encoding 'iso-8859-1';" was used: you'd get a UTF8 string of three characters, but six bytes.

      That's why you should never use \x for literal bytes. Instead, use pack with a "C*" template. Only if nothing in your script uses the new stuff, you can be sure you get the old stuff.

      I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using.

      For someone who's sufficiently skilled in Perl, unicode, and the combination of both, you managed to appear quite clueless in the OP. But now I wonder if you were actually serious (if so, please rephrase your question, this time based on the way you SHOULD use things), or just trolling.

      Perl 6 can use two different string types, Buf and Str.
      If Str is limited to Unicode and only Unicode, that's Nirvana...

      It is. A Buf can have an :encoding attribute, but a Str is always unicode. That is: unicode, not utf-8: exactly as in Perl 5, you must not care about the internal encoding unless you're actually doing internal things. Perl does unicode for its text strings, not utf-8. That's why you have to decode() and encode() yourself.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        If this piece of code was in a module, and the main script disagreed about the encoding, things would have broken. For example, if "use encoding 'iso-8859-1';" was used: you'd get a UTF8 string of three characters, but six bytes.
        That's why you should never use \x for literal bytes. Instead, use pack with a "C*" template. Only if nothing in your script uses the new stuff, you can be sure you get the old stuff.
        This is really a side issue, because as I've stressed, the hex notation was a means to an end. All I wanted was a scalar with a particular sequence of bytes in the PV, and I'd have been just as happy to have gotten it with pack, as you advocate.

        Nevertheless, I have not yet found a way to make the interpolated backslash-x notation misbehave as you suggest it should. Can you please indicate how to modify this code sample so that it illustrates your assertion?

        slothbear:~/perltest marvin$ cat BackslashX.pm package BackslashX; use strict; use warnings; use Encode '_utf8_on'; our $smiley = "\xE2\x98\xBA"; _utf8_on($smiley); 1; slothbear:~/perltest marvin$ cat backslash_x.plx #!/usr/bin/perl use strict; use warnings; use encoding 'iso-8859-1'; use BackslashX; use Devel::Peek; Dump($BackslashX::smiley); print $BackslashX::smiley; print "\n"; slothbear:~/perltest marvin$ perl backslash_x.plx SV = PV(0x1834224) at 0x181ed98 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x372010 "\342\230\272"\0 [UTF8 "\x{263a}"] CUR = 3 LEN = 4 "\x{263a}" does not map to iso-8859-1 at backslash_x.plx line 11. \x{263a}
        That's clearly broken, but only because Unicode code point 0x263a doesn't map to Latin-1. How do I get the 6-byte combo?
        For someone who's sufficiently skilled in Perl, unicode, and the combination of both, you managed to appear quite clueless in the OP. But now I wonder if you were actually serious (if so, please rephrase your question, this time based on the way you SHOULD use things), or just trolling.

        Trolling? On the contrary: I'm doing my best to keep this discussion low-key despite some rather provocative remarks about my competence that have gone by. :)

        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com