in reply to 8-bit Clean XML Data I/O?

If you don't know what encoding the character data is in, it isn't very useful. You might as well strip it out completely because without figuring out the encoding, it is just junk. You may be able to puzzle out the encoding by looking at the characters. For European languages, it is probably ISO-8815-1, might have Windows CP1252 characters in it.

Many of the 8-bit encoding can be translated to Unicode and back again without loosing any information. You will need to choose an encoding that works well for this; Latin1 or CP1252 are reasonable choices. There are two ways to handle this in XML. The best way is probably write the XML in your encoding and tag it.

XML::Writer looks like it doesn't do any translation. You will need to write the chosen encoding, make sure the file is in binary mode, and write the strings. The parsers that XML::Simple support encodings on the file. But they need to know about the encoding because they translate everything into Unicode character data. Which are stored in Perl in UTF-8 and will need to be translated to your "safe" encoding after reading.

Replies are listed 'Best First'.
Re: Re: 8-bit Clean XML Data I/O?
by samtregar (Abbot) on Feb 21, 2004 at 00:32 UTC
    If you don't know what encoding the character data is in, it isn't very useful.

    I've heard this before, and it never struck me as persuasive. The rest of this system is very useful and it doesn't need to know the character-set of the data. In fact, there have been far fewer character-set related bugs in this system than in a previous "100% Unicode" system which performed a similar function.

    You may be able to puzzle out the encoding by looking at the characters.

    Oh, I've been there before and basically found it to be a giant waste of time. Even when it works it's rarely 100% successful. Losing data, even "junk" data which doesn't work for any character set, is not an option in this application.

    -sam

      The characters are going to get displayed at some point. If the wrong encoding is used, they are going to be displayed as junk. The Hebrew is not going to be look right when displayed as Russian.

      Now, most systems deal with this by context. Everyone uses the same encoding for input and output and it all works. Until someone uses a different locale. Or they cut-and-paste from an app that doesn't declare the encoding. Or they send the file/email/database to someone else.

      Also, XML is logically defined as using Unicode characters. Files either have the default encoding of UTF-16 or UTF-8, or they must declare the encoding. Many parsers will convert from the declared encoding to Unicode strings and only deal with Unicode.

      Your choices are to: a) figure out what encoding is being used and mark the XML with that; b) generate invalid XML by not marking the encoding and using 8-bit bytes instead of UTF-8; c) finding a safe encoding and transforming the Unicode back into binary bytes; d) transcoding to UTF-8 and using that everywhere. a and d are the best solutions and are standard.

        The characters are going to get displayed at some point. If the wrong encoding is used, they are going to be displayed as junk. The Hebrew is not going to be look right when displayed as Russian.

        Sure, that's true. Of course, the same thing happens all the time with Unicode applications that guess the wrong character set and botch the conversion to and from UTF-8!

        Watching a Unicode app puke all over my data is what convinced me to make the next version 8-bit clean. I'll happily let the end-user worry about choosing the right character set and setting the right headers on their output. I'll even show them how to extend the app to verify that their data is in the right character set for what they're doing. But I'll be damned if I'm going to pretend I can know the character set of any given input in the general case.

        Like all trade-offs, this one will take time to prove itself. So far the comparison has been a good one, but we'll see!

        -sam