danjkool35 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to parse some xml files, some of which contain non-UTF-8 characters, using XML::Simple.

I get the following error message:

/Users/Dan/Documents/Corpora/HIV_Database/xml/1438278.xml:66: parser error : Input is not proper UTF-8, indicate encoding !

Does anyone how to tweak the options, so I can either exclude these characters from the file or better still read them in.

Thanks

  • Comment on XML::Simple Non-UTF-8 characters won't read

Replies are listed 'Best First'.
Re: XML::Simple Non-UTF-8 characters won't read
by ikegami (Patriarch) on Feb 01, 2011 at 17:49 UTC

    The default encoding for XML is UTF-8. If your document doesn't use UTF-8, it needs to indicate the encoding it did use.

    You appear to have the first of the following (or maybe no <?xml?> at all). You need to change it to the last.

    <?xml version="1.0"?> UTF-8 <?xml version="1.0" encoding="UTF-8"?> UTF-8 <?xml version="1.0" encoding="Windows-1252"?> Windows-1252

    If you're not the one who is producing this bad XML, you can still easily fix it by applying a substitution before passing the XML to the XML parser.

      Excellent, thanks. As it happens I'm not the author of the xml files. I'm just installing the Endoding::FixLatin module from CPAN. Hopefully this should do the substitutions.
Re: XML::Simple Non-UTF-8 characters won't read
by grantm (Parson) on Feb 02, 2011 at 00:04 UTC

    In addition to ikegami's excellent description of what is wrong with your XML input, another option to consider is to pass all input through Encoding::FixLatin before passing it to XML::Simple. This will ensure that the characters are UTF-8 encoded by the time they get to the parser. This assumes that your input is ASCII/Latin-1/CP1252/UTF-8 or some mixture thereof.

      Thanks for a great suggestion. This should solve my problem.