in reply to Re^5: XML:: DOM and Accented Characters
in thread XML:: DOM and Accented Characters

Picwick, here's the code prior to trying any of the suggestions made. It's a simple test xml with nothing but a couple a spaces and an e-accute.

I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:

<?xml version="1.0" encoding="UTF-8"?> <TEST> é </TEST>

And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.

use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); # Print to string print $doc->toString; # cleanup $doc->dispose;

Replies are listed 'Best First'.
Re^7: XML:: DOM and Accented Characters
by Pickwick (Beadle) on Aug 08, 2010 at 12:59 UTC
    Picwick, here's the code prior to trying any of the suggestions made.

    We don't need the code prior the suggestions because we already know why this code can't work as expacted. We need the code where you override automatic encoding of the perl I/O layer with >:utf8, because this code really should work.

    Give us your latest code, there shure is an error somewhere.

      Hi, I've got 5 versions of code incorporating various suggestions made to me, none of which I can (yet) get to work on windows. The last version I have tested on a Unix machine and it worked OK. Trying to open this Unix created XML on windows results in it opening OK

      #!/bin/perl -w use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding +=> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); #re-open file in UTF-8 encoded filehandle open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; $doc->print($fh); # cleanup $doc->dispose;
      #!/bin/perl -w use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding +=> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); #re-open file in UTF-8 encoded filehandle open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; print $fh "\x{FEFF}"; # BOM $doc->print($fh); # cleanup $doc->dispose;
      #!/bin/perl -w use XML::DOM; use UTF8BOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding +=> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); UTF8BOM->insert_into_file('c:\\accentTestOutPut.xml'); # cleanup $doc->dispose;
      #!/bin/perl -w use XML::DOM; use Encode qw(encode_utf8); my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding = +> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; encode_utf8($fh); $doc->print($fh); # cleanup $doc->dispose;
      #!/bin/perl -w use XML::DOM; use PerlIO::encoding; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding = +> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); open my $fh, ">:encoding(UTF-8)", "accentTestOutPut.xml" or die $!; $doc->print($fh); # cleanup $doc->dispose;

      What I have also discovered is that changing the 1st line of the XML to <?xml version="1.0" encoding="windows-1252"?> (as suggested by ikegami) in all cases results in me being able to open the file OK in windows.

        The idea was to not call ->printToFile (which you're doing in all five cases), but to use the suggested code instead:

        #!/usr/bin/perl -w use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); open my $fh, ">:utf8", "c:\\accentTestOutPut.xml" or die $!; print $fh "\x{FEFF}"; # BOM $doc->print($fh); $doc->dispose;