nglenn has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse an XML file with XML simple. The XML file has some attributes that include emdashes. Here is the xml:

<?xml version="1.0" encoding="utf-8" standalone="yes"?> <table id="{4fa6cd7a-f7b6-416d-8f59-3acc0eab9bdb}"> <level type="b"> <map abrv="JS—M"/> <map abrv="JS—K"/> </level> </ettx>

Here is the code to show the error:

#!/usr/bin/perl -l use strict; use warnings; use XML::Simple; use Data::Dumper; my $xml = XML::Simple->new(); my $fileName = 'C:\Users\nate\Desktop\testXML.txt'; my $tree = { table => {} }; $tree->{table} = $xml->XMLin($fileName, ForceArray => ['map'], KeyAttr => [] ) ->{table}; print Dumper($tree->{table});

Just change the file path to test it. The error produced when I run it is: Cannot decode string with wide characters at C:/Perl/lib/Encode.pm line 174.

What really annoys me is that there is a much bigger file that has the emdash in some attributes and it parses fine until I remove a certain number of lines, after which it produces the above error. Any suggestions?

Replies are listed 'Best First'.
Re: emdash problems with XML::Simple
by ikegami (Patriarch) on Aug 26, 2010 at 19:25 UTC
    What parser is your XML::Simple using? Try using XML::Parser instead.
    local $XML::Simple::PREFERRED_PARSER = 'XML::Parser';

      Wow! That got rid of my longtime error. I never would have thought of that in a million years. I thought XML::Simple always used XML::Parser.

      It still does something I don't expect thought... 'table'=>undef is in the Dumper printout. Why isn't it reading it properly?

        "XML::Simple will default to using a SAX parser if one is available or XML::Parser if SAX is not available."

        You probably ended up using the PurePerl SAX parser which is/was buggy when it came to encodings. XML::Parser is by far the fastest existing backend for XML::Simple, according to my benchmarks a year ago.

        XML::Simple removes the root by default.

        my $tree = { table => {} }; $tree->{table} = $xml->XMLin($fileName, ForceArray => ['map'], KeyAttr => [] ) ->{table};
        should be
        my $tree = $xml->XMLin($fileName, ForceArray => [qw( map )], KeyAttr => {}, KeepRoot => 1, );
Re: emdash problems with XML::Simple
by psini (Deacon) on Aug 26, 2010 at 18:53 UTC

    I tried your files with the following results:

    Saving the files "as is", I got an error "not well-formed (invalid token) at line 4, column 20, byte 153 at /usr/share/perl5/XML/Simple.pm line 362"

    Saving the XML file as an UTF-8 file (with BOM), I get an error because the last line should be "</table>" not "</ettx>"

    Correcting this I get no error, and the output "$VAR1 = undef;"

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

      I messed up the xml in the post. This still gives me the same error:

      <?xml version="1.0" encoding="utf-8" standalone="yes"?> <level type="b"> <map abrv="JS—M"/> <map abrv="JS—K"/> </level>

      I saved it in a file as UTF-8 with signature.