davies has asked for the wisdom of the Perl Monks concerning the following question:
XML::LibXML seems to be doing strange things to UTF-8 encoded strings.
use strict; use warnings; use Encode qw(encode); use XML::LibXML; my $uchar = chr(195) . chr(154); my $xml = '<?xml version="1.0" encoding="UTF-8"?> <container><node>' . $uchar . '</node></container>'; output($uchar); my $dom = XML::LibXML->load_xml(string => $xml); my $node = $dom->findnodes('/container/node')->to_literal; output($node); my $encoded = encode('UTF-8', $node); output($encoded); sub output { my $str = shift; print "$str\n"; for (1..length($str)) { print ord(substr($str, $_-1)), ': '; } print "\n"; }
Some of my output is below. I have removed the lines printing the characters as that would involve more rendering issues.
195: 154: 218: 195: 154:
My real case is reading files, but I am getting the issue demonstrated in this example. The character I have chosen is one that is causing problems (a U with an acute accent), but other characters are being transformed as well.
Given that the XML is flagged as being UTF-8, I cannot see anything in the docs indicating why this transformation should take place. What have I missed, please?
Regards,
John Davies
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: UTF-8 and XML::LibXML
by choroba (Cardinal) on Nov 26, 2019 at 12:11 UTC | |
by davies (Monsignor) on Nov 26, 2019 at 12:17 UTC | |
by choroba (Cardinal) on Nov 26, 2019 at 12:31 UTC | |
by davies (Monsignor) on Nov 26, 2019 at 12:46 UTC | |
by choroba (Cardinal) on Nov 26, 2019 at 12:56 UTC | |
by ikegami (Patriarch) on Nov 26, 2019 at 20:37 UTC | |
by haj (Vicar) on Nov 26, 2019 at 13:33 UTC | |
| |
by ikegami (Patriarch) on Nov 26, 2019 at 20:30 UTC |