in reply to 8-bit Clean XML Data I/O?

Of course you can encode the data, in Base64, or in a smarter way, if your data is mostly ascii for example. Beyond that, I don't really see how you can store data in XML if you don't know it's encoding. You would think that you could find a nice encoding that covered characters 0-256, which would allow you to parse the data, and then later figure out what to do with it. The problem is that parsers tend to want to convert what they get into utf8. At least XML::Parser and XML::LibXML do this, so if you lie about the encoding of the data, then you will get it, converted to utf8 from the wrong encoding... :--(

That said XML::Twig has a mode in which it uses the original data instead of the utf8 one. You can get that data in XML::Parser too, use the original_string method on the XML::Parser::Expat object. But you have to make sure that no matter what the real encoding is, the data will be valid for the "fake" one you declare your document to be in. I don't know enough about encodings to have a suggestion there.

But frankly, if I was dealing with sources in various encodings, I would try really hard to get them all in Unicode before trying to hack something like this.

Replies are listed 'Best First'.
Re: Re: 8-bit Clean XML Data I/O?
by samtregar (Abbot) on Feb 21, 2004 at 00:02 UTC
    But frankly, if I was dealing with sources in various encodings, I would try really hard to get them all in Unicode before trying to hack something like this.

    You may be right, but I don't think it's much of an option for me. This XML system is an add-on to an existing web-app which is 8-bit clean by design. Basically, by the time I'm interested in doing XML I/O the source character set is long gone. Modifying the app to somehow intuit the character set on input is possible, but far from ideal.

    Thanks,
    -sam