You can always represent all of Unicode in an XML document using entities, but that is a separate issue from the encoding used by a particular XML document and whether and how it gets converted upon parsing. Your post sounds like you have a heap of flawed assumptions about encodings. (To be sure, most people do, I am not scolding you.) Please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) The topic is complex and much harder to consume than first appearances suggest. I have it down fairly solidly at this point (after a good bit of work), and I still occasionally embarrass myself.
Makeshifts last the longest.
| [reply] |
<?xml version="1.0" encoding="ISO-8859-1"?>
<something>
А
</something>
Now, what is an XML parser supposed to do with this? My understanding is that it always should return "unicode" data in some way - how else can it interpret the А character? But I've been as of yet unable to find any official reference on this situation. Maybe I'm just being stupid, or I've looked in the wrong places, but I can't find any "offical" backup for or against my intuition.
| [reply] [d/l] [select] |
Yes. The recommended course is for parsers to always upgrade the input to Unicode for the internal representation of strings as they parse. It is then the XML generator's job to serialize back to whatever output encoding is requested, using entities where it encounters a character that the output encoding cannot represent.
(A corollary is that you cannot preserve entities exactly as they were in the input — nor should you not want to. If you are forced to in order to satisfy some application downstream, then it's broken. A major goal of XML is to make encoding completely transparent to processors.)
But so long as it doesn't encounter a character in the input stream which cannot be represented in the encoding used by the document, a parser might opt to avoid conversion in order to achieve better performance. If the output document is intended to have the same encoding as the input document, this can save a lot of CPU time. The parser might also choose to upgrade only strings which contain unrepresentable characters — as entities, obviously. This assumes that there is a way to internally tag each string with the encoding it uses so that the processor can take this into account. Perl can flag strings as UTF-8, but has no way to tell the encoding of non-UTF-8 strings apart.
If you're not getting slightly dizzy by now, congrats. :-)
Makeshifts last the longest.
| [reply] |