in reply to 8-bit Clean XML Data I/O?
There are some ranges recommended by the W3C to be avoided:Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x +10000-#x10FFFF] /* any Unicode character, excluding the surrogate blo +cks, FFFE, and FFFF. */
So I guess other than that, you're looking at application-specific limitations.[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF], [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF], [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF], [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF], [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF], [#10FFFE-#x10FFFF].
Still, there are ways to work around this even if you're limited to 7 bits. A nice Base64 routine could help if nothing else, but since you're using XML you're already paying relatively heavy storage and complexity premiums in exchange for all the flexibility you're getting. It's often a good tradeoff, but a tradeoff still. So something that explodes your storage and processing like Base64 for your data, which could also lose the clarity of data storage that XML is trying to give you unless it's applied judiciously, might be out.
My guess is you're getting characters in some funky non-ASCII, non-Unicode character set, such as one of the myriad extended ASCII sets, or possibly that you're getting actual binary data from somewhere. If your spec says it's all characters, then you may have to convert into a Unicode or UTF encoding. I'd recommend UTF-8, which does, in fact, support larger than 7-bit characters when properly encoded. It just requires that characters other than the traditional 7-bit ASCII characters be encoded with an escape value and additional bytes. I'm not sure of the specifics beyond that, but I do know that's the basic idea.
Of course, how well XML::Writer and XML::Simple handle such things I don't know. I'm just grasping at straws that you may not have grapsed at yourself yet. Hopefully I've touched on something you just haven't noticed yet.
|
|---|