in reply to Re: Never touch (or look at) the UTF8 flag!!
in thread Interventionist Unicode Behaviors
Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes.
If this piece of code was in a module, and the main script disagreed about the encoding, things would have broken. For example, if "use encoding 'iso-8859-1';" was used: you'd get a UTF8 string of three characters, but six bytes.
That's why you should never use \x for literal bytes. Instead, use pack with a "C*" template. Only if nothing in your script uses the new stuff, you can be sure you get the old stuff.
I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using.
For someone who's sufficiently skilled in Perl, unicode, and the combination of both, you managed to appear quite clueless in the OP. But now I wonder if you were actually serious (if so, please rephrase your question, this time based on the way you SHOULD use things), or just trolling.
Perl 6 can use two different string types, Buf and Str.If Str is limited to Unicode and only Unicode, that's Nirvana...
It is. A Buf can have an :encoding attribute, but a Str is always unicode. That is: unicode, not utf-8: exactly as in Perl 5, you must not care about the internal encoding unless you're actually doing internal things. Perl does unicode for its text strings, not utf-8. That's why you have to decode() and encode() yourself.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Never touch (or look at) the UTF8 flag!!
by creamygoodness (Curate) on Sep 11, 2006 at 19:48 UTC | |
by Juerd (Abbot) on Sep 11, 2006 at 22:11 UTC |