in reply to Re: 8-bit Clean XML Data I/O?
in thread 8-bit Clean XML Data I/O?

If you don't know what encoding the character data is in, it isn't very useful.

I've heard this before, and it never struck me as persuasive. The rest of this system is very useful and it doesn't need to know the character-set of the data. In fact, there have been far fewer character-set related bugs in this system than in a previous "100% Unicode" system which performed a similar function.

You may be able to puzzle out the encoding by looking at the characters.

Oh, I've been there before and basically found it to be a giant waste of time. Even when it works it's rarely 100% successful. Losing data, even "junk" data which doesn't work for any character set, is not an option in this application.

-sam

Replies are listed 'Best First'.
Re: Re: Re: 8-bit Clean XML Data I/O?
by iburrell (Chaplain) on Feb 22, 2004 at 20:41 UTC
    The characters are going to get displayed at some point. If the wrong encoding is used, they are going to be displayed as junk. The Hebrew is not going to be look right when displayed as Russian.

    Now, most systems deal with this by context. Everyone uses the same encoding for input and output and it all works. Until someone uses a different locale. Or they cut-and-paste from an app that doesn't declare the encoding. Or they send the file/email/database to someone else.

    Also, XML is logically defined as using Unicode characters. Files either have the default encoding of UTF-16 or UTF-8, or they must declare the encoding. Many parsers will convert from the declared encoding to Unicode strings and only deal with Unicode.

    Your choices are to: a) figure out what encoding is being used and mark the XML with that; b) generate invalid XML by not marking the encoding and using 8-bit bytes instead of UTF-8; c) finding a safe encoding and transforming the Unicode back into binary bytes; d) transcoding to UTF-8 and using that everywhere. a and d are the best solutions and are standard.

      The characters are going to get displayed at some point. If the wrong encoding is used, they are going to be displayed as junk. The Hebrew is not going to be look right when displayed as Russian.

      Sure, that's true. Of course, the same thing happens all the time with Unicode applications that guess the wrong character set and botch the conversion to and from UTF-8!

      Watching a Unicode app puke all over my data is what convinced me to make the next version 8-bit clean. I'll happily let the end-user worry about choosing the right character set and setting the right headers on their output. I'll even show them how to extend the app to verify that their data is in the right character set for what they're doing. But I'll be damned if I'm going to pretend I can know the character set of any given input in the general case.

      Like all trade-offs, this one will take time to prove itself. So far the comparison has been a good one, but we'll see!

      -sam