in reply to XML:: DOM and Accented Characters

On Windows,
use strict; use warnings; use XML::DOM; my $xml = <<"__EOI__"; <?xml version="1.0" encoding="UTF-8"?> <TEST> \xC3\xA9 </TEST> __EOI__ my $parser = new XML::DOM::Parser; my $doc = $parser->parse($xml); $doc->printToFile("test.xml");
>perl a.pl >perl -e"$/=\16; while (<>) { my $s=uc unpack 'H*', $_; $s=~s/..\K/ /g +; print qq{$s\n}; }" test.xml 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 46 2D 38 22 3F 3E 0A 3C 54 45 53 54 3E 20 E9 20 3C 2F 54 45 53 54 3E 0A

As previously shown, XML::DOM doesn't encode for you (as it should). So let's try with the previously mentioned fix:

use strict; use warnings; use XML::DOM; my $xml = <<"__EOI__"; <?xml version="1.0" encoding="UTF-8"?> <TEST> \xC3\xA9 </TEST> __EOI__ my $parser = new XML::DOM::Parser; my $doc = $parser->parse($xml); open my $fh, ">:utf8", "test.xml" or die $!; $doc->printToFileHandle($fh);
>perl a.pl >perl -e"$/=\16; while (<>) { my $s=uc unpack 'H*', $_; $s=~s/..\K/ /g +; print qq{$s\n}; }" test.xml 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 46 2D 38 22 3F 3E 0A 3C 54 45 53 54 3E 20 C3 A9 20 3C 2F 54 45 53 54 3E 0A

Perl did its thing correctly, so you have a problem with your editor. There are some solutions:

You might want to check (using the above command) to make sure your input contains what you think it contains.

Replies are listed 'Best First'.
Re^2: XML:: DOM and Accented Characters
by freeflyer (Novice) on Aug 09, 2010 at 09:53 UTC

    Hi ikegami

    I've tried selecting utf8 in editor menus and inserting a BOM but neither has seemingly worked. I think my files are coming out windows-1252 encoded because without runnig the code you provided and just changing the 1st line to

    <?xml version="1.0" encoding="windows-1252"?>

    results in me being able to open the file OK

    What is confusing me is that running the code below

    #!/bin/perl -w use XML::DOM; use PerlIO::encoding; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding = +> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); open my $fh, ">:encoding(UTF-8)", "accentTestOutPut.xml" or die $!; $doc->print($fh); $doc->dispose;

    In windows results in a file (that to my untrained eye) appears to not be UTF8 encoded (in hex I do not see the C3 A9 for the e-acute) and will not open without the above mentioned 1st line change, however

    If I run the same code in Unix and open the resulting file in windows its all fine. It appears properly utf8 encoded and viewing the file in hex shows the C3 A9 expected for the e-acute

    At the moment I'm not understanding why the problem is with the editors (not saying it isn't just don't understand why yet). Whats confusing me is the file created using the same code on Unix opens without issue?

      I think my files are coming out windows-1252 encoded

      You think? I gave you a tool to check. I also asked that you check your input file.

        badly worded, I've got it sorted now. Thanks for the time but the problem was down to me not noticing something incredibly obvious. The issue was I hadn't noticed the filepath when opening the filehandle in the code I gave had the C:\\ knocked off, so the resulting file was being saved elsewhere. I only noticed after almut got me to take out the printToFile method

        Incredibly stupid but I've learn't a lot more by leading myself down the garden path