samtregar has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to answer this question ever since I started playing with XML. I've got some 8-bit character data. I don't know what character set it's in, and I can't find out. I want to put it in an XML document such that when I read that document later I get the same 8-bit characters in Perl.

At the moment I'm writing the data using XML::Writer, with code like:

my $writer = XML::Writer->new(OUTPUT => $fh, DATA_MODE => 1, DATA_INDENT => 4); $writer->dataElement(foo => $bar); $writer->end();

Then, later I try to read it in again using XML::Simple:

    my $data = XMLin($xml, %args);

This blows up when $bar contains characters that aren't legal for UTF-8:

not well-formed (invalid token) at line 25, column 102, byte 980 a +t /usr/local/krang/lib/i686-linux/XML/Parser/Expat.pm line 478

What is to be done?

UPDATE: Taking gmpassos's suggestion, I adopted a mechanism similar to XML::Smart. I created a sub-class of XML::Writer which will automatically Base64 encode character content which has illegal characters in it. This content is prefixed with a "!!!BASE64!!!" marker. I then created a sub-class of XML::Simple which will automatically decode these sections by looking for the marker.

It sure isn't pretty, but it sure does work. Maybe someday I'll come up with something more elegent, but until then I'm happy to mark this one FIXED in Bugzilla and move on. Thanks monks!

-sam

Replies are listed 'Best First'.
Re: 8-bit Clean XML Data I/O?
by mirod (Canon) on Feb 20, 2004 at 23:30 UTC

    Of course you can encode the data, in Base64, or in a smarter way, if your data is mostly ascii for example. Beyond that, I don't really see how you can store data in XML if you don't know it's encoding. You would think that you could find a nice encoding that covered characters 0-256, which would allow you to parse the data, and then later figure out what to do with it. The problem is that parsers tend to want to convert what they get into utf8. At least XML::Parser and XML::LibXML do this, so if you lie about the encoding of the data, then you will get it, converted to utf8 from the wrong encoding... :--(

    That said XML::Twig has a mode in which it uses the original data instead of the utf8 one. You can get that data in XML::Parser too, use the original_string method on the XML::Parser::Expat object. But you have to make sure that no matter what the real encoding is, the data will be valid for the "fake" one you declare your document to be in. I don't know enough about encodings to have a suggestion there.

    But frankly, if I was dealing with sources in various encodings, I would try really hard to get them all in Unicode before trying to hack something like this.

      But frankly, if I was dealing with sources in various encodings, I would try really hard to get them all in Unicode before trying to hack something like this.

      You may be right, but I don't think it's much of an option for me. This XML system is an add-on to an existing web-app which is 8-bit clean by design. Basically, by the time I'm interested in doing XML I/O the source character set is long gone. Modifying the app to somehow intuit the character set on input is possible, but far from ideal.

      Thanks,
      -sam

Re: 8-bit Clean XML Data I/O?
by gmpassos (Priest) on Feb 21, 2004 at 02:15 UTC
    Try XML::Smart. It handles ASCII contents and binary contents automatically. Soo, if you put some binary data, let's say, and JPEG, as a content of a node, it will be automatically converted to base64. Also it will load and decode automatically base64 contents. Note that this approach is a recomendation of binary handleing at XML.com.

    Enjoy! ;-P

    Graciliano M. P.
    "Creativity is the expression of the liberty".

      Very interesting. I think it might be too much work to convert my app to use XML::Smart, but I might be able to adopt this approach. It seems very sensible.

      -sam

Re: 8-bit Clean XML Data I/O?
by diotalevi (Canon) on Feb 20, 2004 at 23:05 UTC
    It seems that when you write your document you'll have to have settled on UTF-8. You could encode your binary data to be unicode safe and then unescape it afterward. There isn't any "binary-data" character set in XML so the straight write:binary/read:utf8 cycle won't work. You could just toss your XML parser and handle your input yourself.
      You could encode your binary data to be unicode safe and then unescape it afterward.

      How do I encode arbitrary 8-bit character data as UTF-8 so that I can get it back again when I read it?

      You could just toss your XML parser and handle your input yourself.

      After all the time I put into getting XML::Validator::Schema working, I can't imagine going that route.

      Thanks,
      -sam

        There's no magic encoding here - you have to either use a pre-existing encode/decode or write your own. You might just use base64.
Re: 8-bit Clean XML Data I/O?
by mr_mischief (Monsignor) on Feb 20, 2004 at 23:06 UTC
    As much as I should probably add the buzzword to my resume, I haven't learned all that much about XML just yet. I don't know off the top of my head if it's 8-bit clean or whatnot. According to the standards info I've found in the last two minutes, XML itself allows "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646". The Character Range is:
    Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x +10000-#x10FFFF] /* any Unicode character, excluding the surrogate blo +cks, FFFE, and FFFF. */
    There are some ranges recommended by the W3C to be avoided:
    [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF], [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF], [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF], [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF], [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF], [#10FFFE-#x10FFFF].
    So I guess other than that, you're looking at application-specific limitations.

    Still, there are ways to work around this even if you're limited to 7 bits. A nice Base64 routine could help if nothing else, but since you're using XML you're already paying relatively heavy storage and complexity premiums in exchange for all the flexibility you're getting. It's often a good tradeoff, but a tradeoff still. So something that explodes your storage and processing like Base64 for your data, which could also lose the clarity of data storage that XML is trying to give you unless it's applied judiciously, might be out.

    My guess is you're getting characters in some funky non-ASCII, non-Unicode character set, such as one of the myriad extended ASCII sets, or possibly that you're getting actual binary data from somewhere. If your spec says it's all characters, then you may have to convert into a Unicode or UTF encoding. I'd recommend UTF-8, which does, in fact, support larger than 7-bit characters when properly encoded. It just requires that characters other than the traditional 7-bit ASCII characters be encoded with an escape value and additional bytes. I'm not sure of the specifics beyond that, but I do know that's the basic idea.

    Of course, how well XML::Writer and XML::Simple handle such things I don't know. I'm just grasping at straws that you may not have grapsed at yourself yet. Hopefully I've touched on something you just haven't noticed yet.



    Christopher E. Stith
Re: 8-bit Clean XML Data I/O?
by iburrell (Chaplain) on Feb 21, 2004 at 00:22 UTC
    If you don't know what encoding the character data is in, it isn't very useful. You might as well strip it out completely because without figuring out the encoding, it is just junk. You may be able to puzzle out the encoding by looking at the characters. For European languages, it is probably ISO-8815-1, might have Windows CP1252 characters in it.

    Many of the 8-bit encoding can be translated to Unicode and back again without loosing any information. You will need to choose an encoding that works well for this; Latin1 or CP1252 are reasonable choices. There are two ways to handle this in XML. The best way is probably write the XML in your encoding and tag it.

    XML::Writer looks like it doesn't do any translation. You will need to write the chosen encoding, make sure the file is in binary mode, and write the strings. The parsers that XML::Simple support encodings on the file. But they need to know about the encoding because they translate everything into Unicode character data. Which are stored in Perl in UTF-8 and will need to be translated to your "safe" encoding after reading.

      If you don't know what encoding the character data is in, it isn't very useful.

      I've heard this before, and it never struck me as persuasive. The rest of this system is very useful and it doesn't need to know the character-set of the data. In fact, there have been far fewer character-set related bugs in this system than in a previous "100% Unicode" system which performed a similar function.

      You may be able to puzzle out the encoding by looking at the characters.

      Oh, I've been there before and basically found it to be a giant waste of time. Even when it works it's rarely 100% successful. Losing data, even "junk" data which doesn't work for any character set, is not an option in this application.

      -sam

        The characters are going to get displayed at some point. If the wrong encoding is used, they are going to be displayed as junk. The Hebrew is not going to be look right when displayed as Russian.

        Now, most systems deal with this by context. Everyone uses the same encoding for input and output and it all works. Until someone uses a different locale. Or they cut-and-paste from an app that doesn't declare the encoding. Or they send the file/email/database to someone else.

        Also, XML is logically defined as using Unicode characters. Files either have the default encoding of UTF-16 or UTF-8, or they must declare the encoding. Many parsers will convert from the declared encoding to Unicode strings and only deal with Unicode.

        Your choices are to: a) figure out what encoding is being used and mark the XML with that; b) generate invalid XML by not marking the encoding and using 8-bit bytes instead of UTF-8; c) finding a safe encoding and transforming the Unicode back into binary bytes; d) transcoding to UTF-8 and using that everywhere. a and d are the best solutions and are standard.