comment on

So, as with everything in Perl, operating in a Microsoft context complicates things. I also was a little technically sloppy with my description for the sake of some simplified high order concept, which I should know is just a recipe for confusion. So I apologize.

A read through of Unicode Support in perlguts as well as perluniintro, perlunitut, and perlunicode might be helpful for further clarifications.

If Perl always used UTF-8 for internal operation, things would be slow (as per the OP). So for strings that are representable via the system's codepage. Specifically (from perluniintro):

Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8.

So until Perl encounters a reason, it will not flip the UTF8 flag you are seeing via Devel::Peek. In your scenario, your XML parser sees the UTF-8 encode at the top of the file, and so the flag gets thrown. Note that if you run

perl -MDevel::Peek -E "Dump chr 199"
[download]

you get something like

SV = PV(0x15ceba8) at 0x15ed468
  REFCNT = 1
  FLAGS = (PADTMP,POK,READONLY,pPOK)
  PV = 0x16359c0 "\307"\0
  CUR = 1
  LEN = 12
[download]

The UTF8 flag is not set, and the same character is being stored according to the local code page.

The thing that seems to be missing from your thinking is serialization. When you feed a string through encode_utf8, you are saying take this logical object, and encode it for communication via a channel that expects UTF-8, much like you might have a channel that expects JSON or a channel that expects little-endian. The resultant bit stream is the encoded stream, and none of the characters it contains are UTF-8 - logically, it contains no wide characters, though it may have a number of high-bit characters. You need to decode the stream in order for it to make sense logically. Now, if Perl is hooked up to a UTF-8 terminal, it'll look right, and if it's hooked up to a 1252 terminal, you'll get junk.

Hopefully this helps?

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

In reply to Re^5: performance of length() in utf-8 by kennethk
in thread performance of length() in utf-8 by seki

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.