in reply to 8-bit Clean XML Data I/O?

As much as I should probably add the buzzword to my resume, I haven't learned all that much about XML just yet. I don't know off the top of my head if it's 8-bit clean or whatnot. According to the standards info I've found in the last two minutes, XML itself allows "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646". The Character Range is:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x +10000-#x10FFFF] /* any Unicode character, excluding the surrogate blo +cks, FFFE, and FFFF. */
There are some ranges recommended by the W3C to be avoided:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF], [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF], [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF], [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF], [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF], [#10FFFE-#x10FFFF].
So I guess other than that, you're looking at application-specific limitations.

Still, there are ways to work around this even if you're limited to 7 bits. A nice Base64 routine could help if nothing else, but since you're using XML you're already paying relatively heavy storage and complexity premiums in exchange for all the flexibility you're getting. It's often a good tradeoff, but a tradeoff still. So something that explodes your storage and processing like Base64 for your data, which could also lose the clarity of data storage that XML is trying to give you unless it's applied judiciously, might be out.

My guess is you're getting characters in some funky non-ASCII, non-Unicode character set, such as one of the myriad extended ASCII sets, or possibly that you're getting actual binary data from somewhere. If your spec says it's all characters, then you may have to convert into a Unicode or UTF encoding. I'd recommend UTF-8, which does, in fact, support larger than 7-bit characters when properly encoded. It just requires that characters other than the traditional 7-bit ASCII characters be encoded with an escape value and additional bytes. I'm not sure of the specifics beyond that, but I do know that's the basic idea.

Of course, how well XML::Writer and XML::Simple handle such things I don't know. I'm just grasping at straws that you may not have grapsed at yourself yet. Hopefully I've touched on something you just haven't noticed yet.



Christopher E. Stith