Unicode XML Parsing Problem

SheridanCat has asked for the wisdom of the Perl Monks concerning the following question:

I am sure this has been asked before, but a Google search and a Super Search here aren't turning up much useful.

I have a data provider who is sending me XML with the following at the top:

<?xml version="1.0" encoding="UNICODE"?>
[download]

And, sure enough, if I look at this in certain editors such as Eclipse, there are Kanji characters in there.

When I try parsing this with XML::libXML, I get the following error:

text_file.xml:1: parser error : Unsupported encoding UNICODE
<?xml version="1.0" encoding="UNICODE"?>
[download]

I get a similar error from XML::Simple, which is no surprised, I suppose.

I understand that expat has builtin encoding for UTF-8, ISO-8859-1, UTF-16, and US-ASCII. So, can anyone shed some light on how I can parse this unicode XML?

If it matters, I'm running ActiveState 5.8.3 on WinXP. Any bit of assistance is appreciated.

Regards, SheridanCat

Comment on Unicode XML Parsing Problem Select or Download Code

Replies are listed 'Best First'.
Re: Unicode XML Parsing Problem by ikegami (Patriarch) on Sep 23, 2005 at 18:26 UTC
UNICODE is not an encoding, at least not on its own. It assigns numbers to characters, but it doesn't dictate how those numbers are represented as bytes. Encodings UTF-8, UTF-16, etc provide this missing information. You'll need to fix the header with the proper encoding. If the file actually uses the UTF-8 enconding you could do use something like the following to fix the file: `$xml =~ s/encoding="(?i:UNICODE)"/encoding="UTF-8"/;` [download]	[reply] [d/l]
Re: Unicode XML Parsing Problem by bart (Canon) on Sep 23, 2005 at 18:38 UTC
"Unicode" is not an encoding. You should try to find out what they mean with it. My guess is that it is a fixed, 2 bytes per character, either in Little Endian order (as used on Windows), or in Big Endian order. You can read more about Unicode encodings on czyborra.com. In particular, check out the sections on UTF-16, and UCS-2. Perhaps you can just make it work without any external help. Drop or reduce the above (IMO) useless header, optionally add a proper BOM, and it might just work. Or perhaps not. If it still doesn't work, it is possible to generate additional character set tables, with tools available as XML::Encoding. For example make_encmap is a script that has been used to produce the tables included with this module. At least it ought to be usable to show you how those tables are constructed.	[reply]
Re: Unicode XML Parsing Problem by Errto (Vicar) on Sep 23, 2005 at 18:52 UTC
Microsoft, as mentioned above, has a bad habit of using the word "Unicode" when they really mean UTF-16 little endian, and thus have tricked many developers into creating XML documents that look like the one you sent, with the invalid character encoding name. With XML::LibXML, this means you can't use `parse_file` because you have to remove the faulty declaration first. Here's a snippet that does it (assuming $xmlfile contains the filename): `my $parser = XML::LibXML->new(); open my $in, '<:encoding(UTF-16)', $xmlfile or die $!; my $xmltext = do { local $/; <$in> }; close $in; $xmltext =~ s/encoding="Unicode"//i; my $doc = $parser->parse_string($xmltext) or die "Could not process XM +L file $xmlfile";` [download] My own Windows app (the one mentioned on my home node) does precisely this.	[reply] [d/l] [select]
Re^2: Unicode XML Parsing Problem by SheridanCat (Pilgrim) on Sep 23, 2005 at 20:35 UTC
Thanks to everyone who responded. This definitely points me in the right direction.	[reply]