comment on

Juerd,

RED FLAG. Here you manually switch on the internal UTF8 flag. You should NEVER do this, unless you know all the details of Perl's UTF8 handling. If the string happened to be stored as latin-1 before, you're lucky because this sequence of bytes happens to also be valid utf-8: you get one smiley. If the string happened to be stored as utf-8 before, nothing happens because the UTF8 flag was already set.

Luck had nothing to do with it. :) It was three bytes in the string because I specified those exact bytes. Then I turned on the UTF8 flag and got exactly what I wanted: a Perl scalar with a PV of 0xE2 0x98 0xBA 0x00, a LEN of 4, a CUR of 3, with the SVf_UTF8 flag ~~set~~ unset, yada yada. I chose not to represent the string as "\x{263a}" or "\N{WHITE SMILING FACE}" because in both of those cases the SVf_UTF8 flag would have been set -- whereas by using raw hex notation, I coerced Perl into parsing the string using byte semantics.

#!/usr/bin/perl
use strict;
use warnings;

use Devel::Peek;

my $bytes = "\xE2\x98\xBA";
my $uni   = "\x{263a}";

# Only one difference between these: the UTF8 flag is on for $uni
Dump($bytes);
Dump($uni); 

__END__
Outputs:

SV = PV(0x1801660) at 0x180b584
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0x300bd0 "\342\230\272"\0
  CUR = 3
  LEN = 4
SV = PV(0x1801678) at 0x180b560
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x316c80 "\342\230\272"\0 [UTF8 "\x{263a}"]
  CUR = 3
  LEN = 4
[download]

I'm personally quite comfortable turning the UTF8 flag on and off, as I'm confident that I know when a sequence of bytes is valid UTF-8 and when it's not. I do a lot of XS programming, and I've written a fair amount of low-level unicode processing code. See this patch of mine for Apache Lucene that fixes the IO classes so that they read/write legal UTF-8 (which they'd claimed they were using in their specs) rather than the Modified UTF-8 they'd actually been using.

However switching SVf_UTF8 on and off is not something I do lightly, or that I would recommend to the casual user, so there we are in agreement.

I'm sure that once you understand how it works, you will also be able to use it, and maybe even love it.

The basic system is not mysterious. SVf_UTF8 is either on or it isn't. (and if it's on, it better be right :).

Perl 6 can use two different string types, Buf and Str.

If Str is limited to Unicode and only Unicode, that's Nirvana...

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

In reply to Re: Never touch (or look at) the UTF8 flag!! by creamygoodness
in thread Interventionist Unicode Behaviors by creamygoodness

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.