in reply to Re: Never touch (or look at) the UTF8 flag!!
in thread Interventionist Unicode Behaviors

Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes.

If this piece of code was in a module, and the main script disagreed about the encoding, things would have broken. For example, if "use encoding 'iso-8859-1';" was used: you'd get a UTF8 string of three characters, but six bytes.

That's why you should never use \x for literal bytes. Instead, use pack with a "C*" template. Only if nothing in your script uses the new stuff, you can be sure you get the old stuff.

I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using.

For someone who's sufficiently skilled in Perl, unicode, and the combination of both, you managed to appear quite clueless in the OP. But now I wonder if you were actually serious (if so, please rephrase your question, this time based on the way you SHOULD use things), or just trolling.

Perl 6 can use two different string types, Buf and Str.
If Str is limited to Unicode and only Unicode, that's Nirvana...

It is. A Buf can have an :encoding attribute, but a Str is always unicode. That is: unicode, not utf-8: exactly as in Perl 5, you must not care about the internal encoding unless you're actually doing internal things. Perl does unicode for its text strings, not utf-8. That's why you have to decode() and encode() yourself.

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

  • Comment on Re^2: Never touch (or look at) the UTF8 flag!!

Replies are listed 'Best First'.
Re^3: Never touch (or look at) the UTF8 flag!!
by creamygoodness (Curate) on Sep 11, 2006 at 19:48 UTC
    If this piece of code was in a module, and the main script disagreed about the encoding, things would have broken. For example, if "use encoding 'iso-8859-1';" was used: you'd get a UTF8 string of three characters, but six bytes.
    That's why you should never use \x for literal bytes. Instead, use pack with a "C*" template. Only if nothing in your script uses the new stuff, you can be sure you get the old stuff.
    This is really a side issue, because as I've stressed, the hex notation was a means to an end. All I wanted was a scalar with a particular sequence of bytes in the PV, and I'd have been just as happy to have gotten it with pack, as you advocate.

    Nevertheless, I have not yet found a way to make the interpolated backslash-x notation misbehave as you suggest it should. Can you please indicate how to modify this code sample so that it illustrates your assertion?

    slothbear:~/perltest marvin$ cat BackslashX.pm package BackslashX; use strict; use warnings; use Encode '_utf8_on'; our $smiley = "\xE2\x98\xBA"; _utf8_on($smiley); 1; slothbear:~/perltest marvin$ cat backslash_x.plx #!/usr/bin/perl use strict; use warnings; use encoding 'iso-8859-1'; use BackslashX; use Devel::Peek; Dump($BackslashX::smiley); print $BackslashX::smiley; print "\n"; slothbear:~/perltest marvin$ perl backslash_x.plx SV = PV(0x1834224) at 0x181ed98 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x372010 "\342\230\272"\0 [UTF8 "\x{263a}"] CUR = 3 LEN = 4 "\x{263a}" does not map to iso-8859-1 at backslash_x.plx line 11. \x{263a}
    That's clearly broken, but only because Unicode code point 0x263a doesn't map to Latin-1. How do I get the 6-byte combo?
    For someone who's sufficiently skilled in Perl, unicode, and the combination of both, you managed to appear quite clueless in the OP. But now I wonder if you were actually serious (if so, please rephrase your question, this time based on the way you SHOULD use things), or just trolling.

    Trolling? On the contrary: I'm doing my best to keep this discussion low-key despite some rather provocative remarks about my competence that have gone by. :)

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com

      That's clearly broken, but only because Unicode code point 0x263a doesn't map to Latin-1. How do I get the 6-byte combo?

      Interesting. That is indeed very much broken.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }