gizzlon has asked for the wisdom of the Perl Monks concerning the following question:

Hello I'm using XML::Simple to read a latin1 (iso-8859-1) xml file and everything is great until it encounters some &#xxx style entities. Strings containing those entities seem to be double encoded or some other weirdness:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> <foo> <p>foo</p> <p>foo with an &#8211;</p> <p>some latin1 encoded chars: æøå ÆØÅ</p> <p>same, but this time whith an &#8211; .. æøå ÆØÅ</p> <p>same, but this thime with an &#8221; instead .. : æøå ÆØÅ</p> </foo>
Read by this script:
my $xmldata = XMLin( $ARGV[0], ForceArray=>1, KeyAttr=>{meta=>"name"} +, SuppressEmpty=>"") or die "Could not parse xml data: $!"; foreach my $f ( @{$xmldata->{'p'} } ) { print $f; print "\n"; } print Dumper($xmldata);
Produces:
./test2.pl foo.xml foo Wide character in print at ./test2.pl line 12. foo with an – some latin1 encoded chars: æøå ÆØÅ Wide character in print at ./test2.pl line 12. same, but this time whith an – .. æøå Ã&#134;Ã&#152;Ã&#133; Wide character in print at ./test2.pl line 12. same, but this thime with an ” instead .. : æøå Ã&#134;Ã&#152;Ã&#13 +3; $VAR1 = { 'p' => [ 'foo', "foo with an \x{2013}", 'some latin1 encoded chars: æøå ÆØÅ', "same, but this time whith an \x{2013} .. \x{c3}\x{ +a6}\x{c3}\x{b8}\x{c3}\x{a5} \x{c3}\x{86}\x{c3}\x{98}\x{c3}\x{85}", "same, but this thime with an \x{201d} instead .. : + \x{c3}\x{a6}\x{c3}\x{b8}\x{c3}\x{a5} \x{c3}\x{86}\x{c3}\x{98}\x{c3}\ +x{85}" ] };
Looks like its double encoded?
Any ideas?

Thanx

Replies are listed 'Best First'.
Re: Entities confuse encoding in XML::Simple
by moritz (Cardinal) on Jan 03, 2008 at 11:30 UTC
    Maybe you have to to set up your environment a bit better:
    binmode STDOUT, ':encdoding(UTF-8)'; # or whatever your terminal uses
      Even if it somehow had the wrong encoding, isn't it strange that some of the output is correct and some is not?

      Anyway, the terminal is utf8 and I was surprised to see that binmode STDOUT, ':encoding(UTF-8)' actually made it worse:
      foo foo with an – some latin1 encoded chars: æøå Ã&#134;Ã&#152;Ã&#133; same, but this time whith an – .. æøå Ã&#134;Ã&#152;Ã&#133; same, but this thime with an ” instead .. : æøå Ã&#134;Ã&#152;Ã&#13 +3; $VAR1 = { 'p' => [ 'foo', "foo with an \x{2013}", 'some latin1 encoded chars: æøå Ã&#134;Ã&#152;Ã& +#133;', "same, but this time whith an \x{2013} .. \x{c3}\x{ +a6}\x{c3}\x{b8}\x{c3}\x{a5} \x{c3}\x{86}\x{c3}\x{98}\x{c3}\x{85}", "same, but this thime with an \x{201d} instead .. : + \x{c3}\x{a6}\x{c3}\x{b8}\x{c3}\x{a5} \x{c3}\x{86}\x{c3}\x{98}\x{c3}\ +x{85}" ] };
        Did you find a solution? I'm having the same problem as you.