Hi

I'm fairly new to Perl and have come up against the following issue. We have (at work) an XML file that contains accented characters. These accented characters are not displaying correctly when parsed and saved back out to a new file using XML:: DOM.

I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:

<?xml version="1.0" encoding="UTF-8"?> <TEST> é </TEST>

And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.

use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); # Print to string print $doc->toString; # cleanup $doc->dispose;

I've used TextPad to view the files in binary format.

Prior to parsing 'accentTest.xml' the hex code used for the e-acute is 'C3 A9' which is correct according to the UTF-8 encoding table @ (http://www.utf8-chartable.de/) the file is also saved as UTF-8 (according to notepad).

After being saved ( $doc->printToFile ("c:\\accentTestOutPut.xml") and viewing in TextPad the hex code used for the e-acute is 'E9' which does not seem to be a valid UTF-8 hex code, the file itself is saved as ANSI (according to notepad anyway). If I view this file in PSPad I can see the e-acute whereas if I use NotePad++ I can not. I am far from an expert but it seems to have something to do with encoding??

If I manually resave "c:\\accentTestOutPut.xml" (using notepad) as UTF-8 I can see my e-acute again in both PSPad and NotePad++.

Has anyone any ideas as to what is going on, hopefully I've explained the issue clearly.

Using XML::LibXML I do not experience the same issue but I have been asked not to use this if possible.


In reply to XML:: DOM and Accented Characters by freeflyer

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.