Of course you can encode the data, in Base64, or in a smarter way, if your data is mostly ascii for example. Beyond that, I don't really see how you can store data in XML if you don't know it's encoding. You would think that you could find a nice encoding that covered characters 0-256, which would allow you to parse the data, and then later figure out what to do with it. The problem is that parsers tend to want to convert what they get into utf8. At least XML::Parser and XML::LibXML do this, so if you lie about the encoding of the data, then you will get it, converted to utf8 from the wrong encoding... :--(

That said XML::Twig has a mode in which it uses the original data instead of the utf8 one. You can get that data in XML::Parser too, use the original_string method on the XML::Parser::Expat object. But you have to make sure that no matter what the real encoding is, the data will be valid for the "fake" one you declare your document to be in. I don't know enough about encodings to have a suggestion there.

But frankly, if I was dealing with sources in various encodings, I would try really hard to get them all in Unicode before trying to hack something like this.


In reply to Re: 8-bit Clean XML Data I/O? by mirod
in thread 8-bit Clean XML Data I/O? by samtregar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.