comment on

Yes. The recommended course is for parsers to always upgrade the input to Unicode for the internal representation of strings as they parse. It is then the XML generator's job to serialize back to whatever output encoding is requested, using entities where it encounters a character that the output encoding cannot represent.

(A corollary is that you cannot preserve entities exactly as they were in the input — nor should you not want to. If you are forced to in order to satisfy some application downstream, then it's broken. A major goal of XML is to make encoding completely transparent to processors.)

But so long as it doesn't encounter a character in the input stream which cannot be represented in the encoding used by the document, a parser might opt to avoid conversion in order to achieve better performance. If the output document is intended to have the same encoding as the input document, this can save a lot of CPU time. The parser might also choose to upgrade only strings which contain unrepresentable characters — as entities, obviously. This assumes that there is a way to internally tag each string with the encoding it uses so that the processor can take this into account. Perl can flag strings as UTF-8, but has no way to tell the encoding of non-UTF-8 strings apart.

If you're not getting slightly dizzy by now, congrats. :-)

Makeshifts last the longest.

In reply to Re^5: Character Conversion Conundrum by Aristotle
in thread Character Conversion Conundrum by SheridanCat

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.