roman has asked for the wisdom of the Perl Monks concerning the following question:

Is there a decent way how a XML document (XML::LibXML::Document) can be serialized ($doc->toString) with diacritics (non ISO-8859-1?) characters turned into Unicode entities? If the document comes from parsing XML without heading it occurs naturally.
use XML::LibXML; my $str = "<flower>r\x{16f}\x{17e}e</flower>"; my $doc = XML::LibXML->new->parse_string($str); warn "Document ", $doc->toString, "\n"; warn "Encoding ", $doc->encoding, "\n";
yields
Document <?xml version="1.0"?> <flower>r&#x16F;&#x17E;e</flower> Encoding
The only way to achieve this effect on already parsed document which I found is to set the encoding to ISO-8859-1 (since I cannot "reset" the encoding).
$doc->setEncoding('iso-8859-1')
use XML::LibXML; my $str = '<?xml version="1.0" encoding="utf8"?>' . "<flower>r\x{16f}\x{17e}e</flower>"; my $doc = XML::LibXML->new->parse_string($str); warn "Document ", $doc->toString, "\n"; warn "Encoding ", $doc->encoding, "\n\n"; $doc->setEncoding('iso-8859-1'); warn "Document ", $doc->toString, "\n"; warn "Encoding ", $doc->encoding, "\n";
yields
Document <?xml version="1.0" encoding="utf8"?> <flower>r&#313;&#379;&#313;že</flower> Encoding utf8 Document <?xml version="1.0" encoding="iso-8859-1"?> <flower>r&#367;&#382;e</flower> Encoding iso-8859-1
Does this method have any danger or drawback? Is there a better way how to "clear" the encoding? I would find it very useful since the serialized text with entities is imune against any encoding changes when stored to database (Oracle). Thanks, Roman

Replies are listed 'Best First'.
Re: XML::LibXML document serialized with diacritics as unicode entities
by Joost (Canon) on Sep 29, 2006 at 21:33 UTC
    Does this method have any danger or drawback? ... I would find it very useful since the serialized text with entities is imune against any encoding changes when stored to database (Oracle).

    Except iso-8859-1 is NOT immume to encoding changes: characters 127 - 255 have the same code points as unicode but not the same encoding in any unicode encoding. They also can't be converted to 7-bit ascii.

    7-bit ascii might be a bit safer, but as the documentation for setEncoding notes:

    Note that this function has to be used very carefully, since you can’t simply convert one encoding in any other, since some (or even all) characters may not exist in the new encoding. XML::LibXML will not test if the operation is allowed or possible for the given document. The only switching assured to work is to UTF8.

    Also note that storing full unicode text as numeric entities is pretty inefficient. If your database and driver support it, using one of the native unicode encodings is probably better.