in reply to Never touch (or look at) the UTF8 flag!!
in thread Interventionist Unicode Behaviors
Juerd,
RED FLAG. Here you manually switch on the internal UTF8 flag. You should NEVER do this, unless you know all the details of Perl's UTF8 handling. If the string happened to be stored as latin-1 before, you're lucky because this sequence of bytes happens to also be valid utf-8: you get one smiley. If the string happened to be stored as utf-8 before, nothing happens because the UTF8 flag was already set.
Luck had nothing to do with it. :) It was three bytes in the string because
I specified those exact bytes. Then I turned on the UTF8 flag and got exactly
what I wanted: a Perl scalar with a PV of 0xE2 0x98 0xBA 0x00, a LEN of 4, a
CUR of 3, with the SVf_UTF8 flag set unset, yada yada. I chose not to represent the
string as "\x{263a}" or "\N{WHITE SMILING FACE}" because in both of
those cases the SVf_UTF8 flag would have been set -- whereas by using raw hex
notation, I coerced Perl into parsing the string using byte semantics.
#!/usr/bin/perl use strict; use warnings; use Devel::Peek; my $bytes = "\xE2\x98\xBA"; my $uni = "\x{263a}"; # Only one difference between these: the UTF8 flag is on for $uni Dump($bytes); Dump($uni); __END__ Outputs: SV = PV(0x1801660) at 0x180b584 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300bd0 "\342\230\272"\0 CUR = 3 LEN = 4 SV = PV(0x1801678) at 0x180b560 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x316c80 "\342\230\272"\0 [UTF8 "\x{263a}"] CUR = 3 LEN = 4
I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using.
However switching SVf_UTF8 on and off is not something I do lightly, or that I would recommend to the casual user, so there we are in agreement.
I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it.
The basic system is not mysterious. SVf_UTF8 is either on or it isn't. (and if it's on, it better be right :).
Perl 6 can use two different string types, Buf and Str.If Str is limited to Unicode and only Unicode, that's Nirvana...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Never touch (or look at) the UTF8 flag!!
by demerphq (Chancellor) on Sep 08, 2006 at 12:51 UTC | |
by creamygoodness (Curate) on Sep 08, 2006 at 13:43 UTC | |
|
Re^2: Never touch (or look at) the UTF8 flag!!
by Juerd (Abbot) on Sep 11, 2006 at 09:32 UTC | |
by creamygoodness (Curate) on Sep 11, 2006 at 19:48 UTC | |
by Juerd (Abbot) on Sep 11, 2006 at 22:11 UTC |