The native / default encoding for XML is UTF-8.
If you look at my code, you'll see that it attempts to determine the charset of the HTML code, and that when it exports the "original" code in XML, the "encoding" attribute is set to that character set in the
element.
I've ended up decoding that chunk of HTML into UTF-8 and exporting it that way as well. Any attempts to do this with arbitrary data with non-UTF8 charsets have failed.