Hi
I'm fairly new to Perl and have come up against the following issue. We have (at work) an XML file that contains accented characters. These accented characters are not displaying correctly when parsed and saved back out to a new file using XML:: DOM.
I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:
<?xml version="1.0" encoding="UTF-8"?> <TEST> é </TEST>
And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.
use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); # Print to string print $doc->toString; # cleanup $doc->dispose;
I've used TextPad to view the files in binary format.
Prior to parsing 'accentTest.xml' the hex code used for the e-acute is 'C3 A9' which is correct according to the UTF-8 encoding table @ (http://www.utf8-chartable.de/) the file is also saved as UTF-8 (according to notepad).
After being saved ( $doc->printToFile ("c:\\accentTestOutPut.xml") and viewing in TextPad the hex code used for the e-acute is 'E9' which does not seem to be a valid UTF-8 hex code, the file itself is saved as ANSI (according to notepad anyway). If I view this file in PSPad I can see the e-acute whereas if I use NotePad++ I can not. I am far from an expert but it seems to have something to do with encoding??
If I manually resave "c:\\accentTestOutPut.xml" (using notepad) as UTF-8 I can see my e-acute again in both PSPad and NotePad++.
Has anyone any ideas as to what is going on, hopefully I've explained the issue clearly.
Using XML::LibXML I do not experience the same issue but I have been asked not to use this if possible.
In reply to XML:: DOM and Accented Characters by freeflyer
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |