clintonm9 has asked for the wisdom of the Perl Monks concerning the following question:

I am having some unicode issues with XML::Simple. I have wrote a test program to show my problem

#!/usr/bin/perl use strict; use utf8; # A simple test to show the UTF8 problem my $parameters; push (@{ $parameters->{Request} }, { URI => '/HRM/EmploymentManager/AvailableOpening +s', Action => 'GET', ID => '123', Parameters => { Status => 'Test', }, }); # convert Perl hash ref into XML my $xs = XML::Simple->new(); my $x = $xs->XMLout($parameters, KeepRoot => 0, RootName => 'Requests' +); print $x; # convert XML into Perl hash ref my $xs = XML::Simple->new(); my $XML = $xs->XMLin($x,ForceArray => 0); # Look at the perl hash ref, there shouldnt be any my $temp = $XML->{'Request'}->{'Action'}; my $flag = utf8::is_utf8($temp); print "$flag ! $temp\n\n\n"; exit;

I am trying to use iso-8859-1 and not use UTF8. Any ideas why the UTF8 Flag is on and to make XML::Simple not make UTF8 when sending iso-8859-1 in the headers

Replies are listed 'Best First'.
Re: UTF8 and XML
by ikegami (Patriarch) on Mar 08, 2010 at 23:43 UTC

    The UTF8 flag indicates which internal storage format is used for the string. It's usually on when the data has been decoded (from UTF-8, iso-8859-1 or whatever), which means it's usually on when the string contains text. Since you're examining the parsed/decoded XML, it's not surprising it's on. After all, the purpose of XML is to store text, and the purpose of parsing an XML document is to extract the text data within it.

    I am trying to use iso-8859-1 and not use UTF8.

    You're using neither iso-8859-1 nor UTF-8; you are using unicode characters. You're free to encode those characters using any encoding you wish (e.g. UTF-8, iso-8859-1, etc) when it's appropriate (e.g. on output).

Re: UTF8 and XML
by ikegami (Patriarch) on Mar 09, 2010 at 00:06 UTC

    Keeping in mind that \311 is the iso-8859-1 encoding of U+00C9, and that \303\211 is the UTF-8 encoding of the same character, you can see that XML::Simple properly extracts text from XML:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper qw( Dumper ); use XML::Simple qw( ); $XML::Simple::PREFERRED_PARSER = 'XML::Parser'; my $latin1_xml = <<"__EOI__"; <?xml version="1.0" encoding="iso-8859-1"?> <root>\311ric</root> __EOI__ my $utf8_xml = <<"__EOI__"; <?xml version="1.0" encoding="UTF-8"?> <root>\303\211ric</root> __EOI__ my $xs = XML::Simple->new(); for my $xml ($latin1_xml, $utf8_xml) { my $tree = $xs->XMLin($xml, ForceArray => 1, KeepRoot => 1, ); local $Data::Dumper::Useqq = 1; print Dumper $tree; }
    $VAR1 = { 'root' => [ "\x{c9}ric" ] }; $VAR1 = { 'root' => [ "\x{c9}ric" ] };

    It also outputs XML properly (albeit using a weird interface):

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper qw( Dumper ); use XML::Simple qw( ); my $tree = { 'root' => [ "\x{c9}ric" ] }; $XML::Simple::PREFERRED_PARSER = 'XML::Parser'; my $xs = XML::Simple->new(); for my $enc (qw( iso-8859-1 UTF-8 )) { my $xml = ''; { open(my $fh, ">:encoding($enc)", \$xml) or die; $xs->XMLout($tree, XMLDecl => qq{<?xml version="1.0" encoding="$enc"?>}, KeepRoot => 1, OutputFile => $fh, ); close($fh); } local $Data::Dumper::Useqq = 1; print Dumper $xml; }
    $VAR1 = "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>\311ri +c</root>\n"; $VAR1 = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root>\303\211ric +</root>\n";

      So if i wanted to send ISO-8859-1 xml i would just add that to the header? Doing this didnt make a difference. Please see

      #!/usr/bin/perl use strict; use utf8; # A simple test to show the UTF8 problem my $parameters; push (@{ $parameters->{Request} }, { URI => '/HRM/EmploymentManager/AvailableOpening +s', Action => 'GET', ID => '123', Parameters => { Status => 'Test', }, }); # convert Perl hash ref into XML my $xs = XML::Simple->new(); my $x = $xs->XMLout($parameters, KeepRoot => 0, RootName => 'Requests' +,XMLDecl => qq{<?xml version="1.0" encoding="iso-8859-1"?>}); print $x; # convert XML into Perl hash ref my $xs = XML::Simple->new(); my $XML = $xs->XMLin($x,ForceArray => 0); # Look at the perl hash ref, there shouldnt be any my $temp = $XML->{'Request'}->{'Action'}; my $flag = utf8::is_utf8($temp); print "$flag ! $temp\n\n\n"; exit;

        No, you removed the encoding. While XML::Simple properly decodes when parsing XML, it doesn't encode when generating XML. (That's a bug. I called it a "weird interface" earlier.)