in reply to Re^4: Character Conversion Conundrum
in thread Character Conversion Conundrum
Yes. The recommended course is for parsers to always upgrade the input to Unicode for the internal representation of strings as they parse. It is then the XML generator's job to serialize back to whatever output encoding is requested, using entities where it encounters a character that the output encoding cannot represent.
(A corollary is that you cannot preserve entities exactly as they were in the input — nor should you not want to. If you are forced to in order to satisfy some application downstream, then it's broken. A major goal of XML is to make encoding completely transparent to processors.)
But so long as it doesn't encounter a character in the input stream which cannot be represented in the encoding used by the document, a parser might opt to avoid conversion in order to achieve better performance. If the output document is intended to have the same encoding as the input document, this can save a lot of CPU time. The parser might also choose to upgrade only strings which contain unrepresentable characters — as entities, obviously. This assumes that there is a way to internally tag each string with the encoding it uses so that the processor can take this into account. Perl can flag strings as UTF-8, but has no way to tell the encoding of non-UTF-8 strings apart.
If you're not getting slightly dizzy by now, congrats. :-)
Makeshifts last the longest.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^6: Character Conversion Conundrum
by Joost (Canon) on Dec 22, 2004 at 23:28 UTC | |
by Aristotle (Chancellor) on Dec 22, 2004 at 23:57 UTC |