in reply to Exporting HTML in an XML document

This sounds like it's an XML encoding issue. I've been using this line at the top of my XML: <?xml version="1.0" encoding="ISO-8859-1"?>. I imagine that this could be set in the constructor, e.g. "encoding" => "ISO-8859-1".

Replies are listed 'Best First'.
Re^2: Exporting HTML in an XML document
by telcontar (Beadle) on Nov 19, 2007 at 10:36 UTC
    The native / default encoding for XML is UTF-8.

    If you look at my code, you'll see that it attempts to determine the charset of the HTML code, and that when it exports the "original" code in XML, the "encoding" attribute is set to that character set in the <original> element.

    I've ended up decoding that chunk of HTML into UTF-8 and exporting it that way as well. Any attempts to do this with arbitrary data with non-UTF8 charsets have failed.

    --telcontar
Re^2: Exporting HTML in an XML document
by telcontar (Beadle) on Nov 19, 2007 at 10:43 UTC
    If anyone else reads this and has run into a similar problem, I suppose another way of getting around this would to Base-64 encode the data. That'd solve all charset and encoding problems right there. Unfortunately, it also adds 33% space overhead (don't care) and decoding overhead when the file's loaded (more of a problem).

    -- telcontar