in reply to Re: Re: 8-bit Clean XML Data I/O?
in thread 8-bit Clean XML Data I/O?

The characters are going to get displayed at some point. If the wrong encoding is used, they are going to be displayed as junk. The Hebrew is not going to be look right when displayed as Russian.

Now, most systems deal with this by context. Everyone uses the same encoding for input and output and it all works. Until someone uses a different locale. Or they cut-and-paste from an app that doesn't declare the encoding. Or they send the file/email/database to someone else.

Also, XML is logically defined as using Unicode characters. Files either have the default encoding of UTF-16 or UTF-8, or they must declare the encoding. Many parsers will convert from the declared encoding to Unicode strings and only deal with Unicode.

Your choices are to: a) figure out what encoding is being used and mark the XML with that; b) generate invalid XML by not marking the encoding and using 8-bit bytes instead of UTF-8; c) finding a safe encoding and transforming the Unicode back into binary bytes; d) transcoding to UTF-8 and using that everywhere. a and d are the best solutions and are standard.

Replies are listed 'Best First'.
Re: Re: Re: Re: 8-bit Clean XML Data I/O?
by samtregar (Abbot) on Feb 22, 2004 at 21:42 UTC
    The characters are going to get displayed at some point. If the wrong encoding is used, they are going to be displayed as junk. The Hebrew is not going to be look right when displayed as Russian.

    Sure, that's true. Of course, the same thing happens all the time with Unicode applications that guess the wrong character set and botch the conversion to and from UTF-8!

    Watching a Unicode app puke all over my data is what convinced me to make the next version 8-bit clean. I'll happily let the end-user worry about choosing the right character set and setting the right headers on their output. I'll even show them how to extend the app to verify that their data is in the right character set for what they're doing. But I'll be damned if I'm going to pretend I can know the character set of any given input in the general case.

    Like all trade-offs, this one will take time to prove itself. So far the comparison has been a good one, but we'll see!

    -sam