in reply to Re^4: Character Conversion Conundrum
in thread Character Conversion Conundrum

Yes. The recommended course is for parsers to always upgrade the input to Unicode for the internal representation of strings as they parse. It is then the XML generator's job to serialize back to whatever output encoding is requested, using entities where it encounters a character that the output encoding cannot represent.

(A corollary is that you cannot preserve entities exactly as they were in the input — nor should you not want to. If you are forced to in order to satisfy some application downstream, then it's broken. A major goal of XML is to make encoding completely transparent to processors.)

But so long as it doesn't encounter a character in the input stream which cannot be represented in the encoding used by the document, a parser might opt to avoid conversion in order to achieve better performance. If the output document is intended to have the same encoding as the input document, this can save a lot of CPU time. The parser might also choose to upgrade only strings which contain unrepresentable characters — as entities, obviously. This assumes that there is a way to internally tag each string with the encoding it uses so that the processor can take this into account. Perl can flag strings as UTF-8, but has no way to tell the encoding of non-UTF-8 strings apart.

If you're not getting slightly dizzy by now, congrats. :-)

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^6: Character Conversion Conundrum
by Joost (Canon) on Dec 22, 2004 at 23:28 UTC
    Ok, so the parser implentation is supposed to deal with possible uncicode characters/codepoints showing up in the resulting text (and should probably document how they deal with it). That makes sense, I guess.

    If you're not getting slightly dizzy by now, congrats. :-)

    I'm not dizzy, but I've been dealing with strangely encoded "xml" documents for some time now, so I've thought hard about it already, and I got plenty dizzy then. :-)

    Any "official" documentation on this XML parser behaviour would still be appreciated, though - I could use it to slap some unnamed third parties with :-)

      Hmm. I wish I could point you somewhere concrete. This is stuff I gleaned from the xml-dev and xsl-list mailing lists, posted by people such as Tim Bray and Michael Kay. I assume the folks who wrote the specs know what they say. :-) Unfortunately it means I don't have any reference point on hand. Maybe I should look it up on opportunity, or at least find relevant archive links from the lists.

      Makeshifts last the longest.