freeflyer has asked for the wisdom of the Perl Monks concerning the following question:
Hi
I'm fairly new to Perl and have come up against the following issue. We have (at work) an XML file that contains accented characters. These accented characters are not displaying correctly when parsed and saved back out to a new file using XML:: DOM.
I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:
<?xml version="1.0" encoding="UTF-8"?> <TEST> é </TEST>
And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.
use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); # Print to string print $doc->toString; # cleanup $doc->dispose;
I've used TextPad to view the files in binary format.
Prior to parsing 'accentTest.xml' the hex code used for the e-acute is 'C3 A9' which is correct according to the UTF-8 encoding table @ (http://www.utf8-chartable.de/) the file is also saved as UTF-8 (according to notepad).
After being saved ( $doc->printToFile ("c:\\accentTestOutPut.xml") and viewing in TextPad the hex code used for the e-acute is 'E9' which does not seem to be a valid UTF-8 hex code, the file itself is saved as ANSI (according to notepad anyway). If I view this file in PSPad I can see the e-acute whereas if I use NotePad++ I can not. I am far from an expert but it seems to have something to do with encoding??
If I manually resave "c:\\accentTestOutPut.xml" (using notepad) as UTF-8 I can see my e-acute again in both PSPad and NotePad++.
Has anyone any ideas as to what is going on, hopefully I've explained the issue clearly.
Using XML::LibXML I do not experience the same issue but I have been asked not to use this if possible.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: XML:: DOM and Accented Characters
by almut (Canon) on Aug 06, 2010 at 10:27 UTC | |
by freeflyer (Novice) on Aug 06, 2010 at 14:12 UTC | |
by almut (Canon) on Aug 06, 2010 at 14:36 UTC | |
by graff (Chancellor) on Aug 06, 2010 at 17:03 UTC | |
by freeflyer (Novice) on Aug 07, 2010 at 10:09 UTC | |
by Pickwick (Beadle) on Aug 07, 2010 at 15:26 UTC | |
| |
by Anonymous Monk on Aug 07, 2010 at 11:41 UTC | |
| |
|
Re: XML:: DOM and Accented Characters
by ikegami (Patriarch) on Aug 07, 2010 at 16:33 UTC | |
by freeflyer (Novice) on Aug 09, 2010 at 09:53 UTC | |
by ikegami (Patriarch) on Aug 09, 2010 at 13:51 UTC | |
by freeflyer (Novice) on Aug 09, 2010 at 15:54 UTC |