As much as I should probably add the buzzword to my resume, I haven't learned all that much about XML just yet. I don't know off the top of my head if it's 8-bit clean or whatnot. According to the standards info I've found in the last two minutes, XML itself allows "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646". The Character Range is:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x +10000-#x10FFFF] /* any Unicode character, excluding the surrogate blo +cks, FFFE, and FFFF. */
There are some ranges recommended by the W3C to be avoided:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF], [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF], [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF], [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF], [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF], [#10FFFE-#x10FFFF].
So I guess other than that, you're looking at application-specific limitations.

Still, there are ways to work around this even if you're limited to 7 bits. A nice Base64 routine could help if nothing else, but since you're using XML you're already paying relatively heavy storage and complexity premiums in exchange for all the flexibility you're getting. It's often a good tradeoff, but a tradeoff still. So something that explodes your storage and processing like Base64 for your data, which could also lose the clarity of data storage that XML is trying to give you unless it's applied judiciously, might be out.

My guess is you're getting characters in some funky non-ASCII, non-Unicode character set, such as one of the myriad extended ASCII sets, or possibly that you're getting actual binary data from somewhere. If your spec says it's all characters, then you may have to convert into a Unicode or UTF encoding. I'd recommend UTF-8, which does, in fact, support larger than 7-bit characters when properly encoded. It just requires that characters other than the traditional 7-bit ASCII characters be encoded with an escape value and additional bytes. I'm not sure of the specifics beyond that, but I do know that's the basic idea.

Of course, how well XML::Writer and XML::Simple handle such things I don't know. I'm just grasping at straws that you may not have grapsed at yourself yet. Hopefully I've touched on something you just haven't noticed yet.



Christopher E. Stith

In reply to Re: 8-bit Clean XML Data I/O? by mr_mischief
in thread 8-bit Clean XML Data I/O? by samtregar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.