in reply to Never touch (or look at) the UTF8 flag!!
in thread Interventionist Unicode Behaviors

Juerd,

RED FLAG. Here you manually switch on the internal UTF8 flag. You should NEVER do this, unless you know all the details of Perl's UTF8 handling. If the string happened to be stored as latin-1 before, you're lucky because this sequence of bytes happens to also be valid utf-8: you get one smiley. If the string happened to be stored as utf-8 before, nothing happens because the UTF8 flag was already set.

Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes. Then I turned on the UTF8 flag and got exactly what I wanted: a Perl scalar with a PV of 0xE2 0x98 0xBA 0x00, a LEN of 4, a CUR of 3, with the SVf_UTF8 flag set unset, yada yada. I chose not to represent the string as "\x{263a}" or "\N{WHITE SMILING FACE}" because in both of those cases the SVf_UTF8 flag would have been set -- whereas by using raw hex notation, I coerced Perl into parsing the string using byte semantics.

#!/usr/bin/perl use strict; use warnings; use Devel::Peek; my $bytes = "\xE2\x98\xBA"; my $uni = "\x{263a}"; # Only one difference between these: the UTF8 flag is on for $uni Dump($bytes); Dump($uni); __END__ Outputs: SV = PV(0x1801660) at 0x180b584 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300bd0 "\342\230\272"\0 CUR = 3 LEN = 4 SV = PV(0x1801678) at 0x180b560 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x316c80 "\342\230\272"\0 [UTF8 "\x{263a}"] CUR = 3 LEN = 4

I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using.

However switching SVf_UTF8 on and off is not something I do lightly, or that I would recommend to the casual user, so there we are in agreement.

I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it.

The basic system is not mysterious. SVf_UTF8 is either on or it isn't. (and if it's on, it better be right :).

Perl 6 can use two different string types, Buf and Str.
If Str is limited to Unicode and only Unicode, that's Nirvana...
--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

Replies are listed 'Best First'.
Re^2: Never touch (or look at) the UTF8 flag!!
by demerphq (Chancellor) on Sep 08, 2006 at 12:51 UTC

    # Only one difference between these: the UTF8 flag is on for $uni

    As far as I understand it this need not be true. You make the comment that you do a lot of XS programming. It seems to me that you are equating the XS concept of altering the PV with the perl concept of assigning to a string. Which with byte semantics is correct, but with utf8 semantics is not. I dont believe that there is an guarantee that this will always be true. For instance perl 5.12 could be entirley unicode internally and your program would break. Likewise, if the string were utf8-on before you did the assignment the result would be different. I think probably if you want to operate on the level you seem to I think you should use pack.

    Also a little nit: perl doesnt do UTF-8, it does utf8, which is subtlely different from true UTF-8. Although in the context here I dont think it matters.

    ---
    $world=~s/war/peace/g

      Can Perl 5.12 start interpreting string literals with utf8 semantics without shattering backwards compatibility? I didn't think that was likely to happen. If it does it won't just be my examples here that break! :)

      I tried to manufacture an example to demo your assertion that if the string were utf8-on prior to assignment the result would differ, but I haven't been able to come up with anything. Certainly concatenating onto the end of an existing string the results would differ, but wholesale assignment?? How would you change the following example to illustrate your point?

      #!/usr/bin/perl use strict; use warnings; use Encode qw( _utf8_on ); use Devel::Peek qw( Dump ); my $foo; $foo = 'foo'; Dump($foo); my $bar; _utf8_on($bar); $bar = 'bar'; Dump($bar); __END__ Outputs: SV = PV(0x1801660) at 0x180b59c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x365f20 "foo"\0 CUR = 3 LEN = 4 SV = PV(0x1801678) at 0x180b56c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300bd0 "bar"\0 CUR = 3 LEN = 4
      I also have to confess that I don't understand what you mean when you say that altering the PV is not the same as assigning to a string under utf8 semantics. Certainly you'd need to alter the PV (and other values, possibly flags too) following the rules of unicode handling -- for instance, checking for illegal sequences at splice points. Is that what you mean? I've spelunked a lot of the utf8 code in sv.c and I don't recall having seen any constructs I didn't recognize.

      And thanks for the suggestion on pack. Maybe if I'd used that in the OP there'd be less blood on the floor! :)

      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com
Re^2: Never touch (or look at) the UTF8 flag!!
by Juerd (Abbot) on Sep 11, 2006 at 09:32 UTC

    Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes.

    If this piece of code was in a module, and the main script disagreed about the encoding, things would have broken. For example, if "use encoding 'iso-8859-1';" was used: you'd get a UTF8 string of three characters, but six bytes.

    That's why you should never use \x for literal bytes. Instead, use pack with a "C*" template. Only if nothing in your script uses the new stuff, you can be sure you get the old stuff.

    I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using.

    For someone who's sufficiently skilled in Perl, unicode, and the combination of both, you managed to appear quite clueless in the OP. But now I wonder if you were actually serious (if so, please rephrase your question, this time based on the way you SHOULD use things), or just trolling.

    Perl 6 can use two different string types, Buf and Str.
    If Str is limited to Unicode and only Unicode, that's Nirvana...

    It is. A Buf can have an :encoding attribute, but a Str is always unicode. That is: unicode, not utf-8: exactly as in Perl 5, you must not care about the internal encoding unless you're actually doing internal things. Perl does unicode for its text strings, not utf-8. That's why you have to decode() and encode() yourself.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      If this piece of code was in a module, and the main script disagreed about the encoding, things would have broken. For example, if "use encoding 'iso-8859-1';" was used: you'd get a UTF8 string of three characters, but six bytes.
      That's why you should never use \x for literal bytes. Instead, use pack with a "C*" template. Only if nothing in your script uses the new stuff, you can be sure you get the old stuff.
      This is really a side issue, because as I've stressed, the hex notation was a means to an end. All I wanted was a scalar with a particular sequence of bytes in the PV, and I'd have been just as happy to have gotten it with pack, as you advocate.

      Nevertheless, I have not yet found a way to make the interpolated backslash-x notation misbehave as you suggest it should. Can you please indicate how to modify this code sample so that it illustrates your assertion?

      slothbear:~/perltest marvin$ cat BackslashX.pm package BackslashX; use strict; use warnings; use Encode '_utf8_on'; our $smiley = "\xE2\x98\xBA"; _utf8_on($smiley); 1; slothbear:~/perltest marvin$ cat backslash_x.plx #!/usr/bin/perl use strict; use warnings; use encoding 'iso-8859-1'; use BackslashX; use Devel::Peek; Dump($BackslashX::smiley); print $BackslashX::smiley; print "\n"; slothbear:~/perltest marvin$ perl backslash_x.plx SV = PV(0x1834224) at 0x181ed98 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x372010 "\342\230\272"\0 [UTF8 "\x{263a}"] CUR = 3 LEN = 4 "\x{263a}" does not map to iso-8859-1 at backslash_x.plx line 11. \x{263a}
      That's clearly broken, but only because Unicode code point 0x263a doesn't map to Latin-1. How do I get the 6-byte combo?
      For someone who's sufficiently skilled in Perl, unicode, and the combination of both, you managed to appear quite clueless in the OP. But now I wonder if you were actually serious (if so, please rephrase your question, this time based on the way you SHOULD use things), or just trolling.

      Trolling? On the contrary: I'm doing my best to keep this discussion low-key despite some rather provocative remarks about my competence that have gone by. :)

      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com

        That's clearly broken, but only because Unicode code point 0x263a doesn't map to Latin-1. How do I get the 6-byte combo?

        Interesting. That is indeed very much broken.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }