The native / default encoding for XML is UTF-8.
If you look at my code, you'll see that it attempts to determine the charset of the HTML code, and that when it exports the "original" code in XML, the "encoding" attribute is set to that character set in the <original> element.
I've ended up decoding that chunk of HTML into UTF-8 and exporting it that way as well. Any attempts to do this with arbitrary data with non-UTF8 charsets have failed.
--telcontar | [reply] [d/l] |
If anyone else reads this and has run into a similar problem, I suppose another way of getting around this would to Base-64 encode the data. That'd solve all charset and encoding problems right there. Unfortunately, it also adds 33% space overhead (don't care) and decoding overhead when the file's loaded (more of a problem).
-- telcontar
| [reply] |