XML File Encoding and Parsing Problem

merrymonk has asked for the wisdom of the Perl Monks concerning the following question:

In an XML file where the first line is
<?xml version="1.0" encoding="UTF-8"?>
I get the following error message when trying to open it with Internet Explorer

The XML page cannot be displayed
Cannot view XML input using XSL style sheet.
Please correct the error and then click the Refresh button, or try again later.
An invalid character was found in text content. Error processing resource
'file name
... characters in line then (1=0

The file also fails to parse, that is the eval below fails

$parser = new XML::DOM::Parser;
eval {$doc = $parser->parsefile($filename)}
[download]

When the first line is
<?xml version="1.0" encoding="ISO-8859-1" ?>
everything works well
The character after the (1=0 is a ° (a degree sign)
I guess that the reason is something like that when you want a degree sign and
UTF-8 coding, the degree sign has to be written in a specific way
or perhaps the parsing has to be done if a different way.
Can a wiser Monk let me know if either is true and what should be done to overcome the problem?

Comment on XML File Encoding and Parsing Problem Download Code

Replies are listed 'Best First'.
Re: XML File Encoding and Parsing Problem by dorward (Curate) on Mar 07, 2006 at 23:09 UTC
I believe the degree sign holds different locations in ISO-8859-1 and UTF-8. If you output ISO-8859-1 and call it UTF-8 then you are going to have problems (unless every character you use happens to hold the same location). The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is probably worth a read. So is perldoc perluniintro	[reply]
Re^2: XML File Encoding and Parsing Problem by graff (Chancellor) on Mar 08, 2006 at 00:11 UTC
If I understand what I think you're trying to say here: you are going to have problems (unless every character you use happens to hold the same location) that is actually a little misleading. UTF-8 encoding is designed such that, in order for "every character you use ... to hold the same location", the file must consist entirely of byte values below 0x80 -- that is, it must be a pure ASCII file. (And if that is the case, then technically, the file is UTF-8, because ASCII data is a proper subset of UTF-8 data.) If text is encoded in any character set other than utf8 (e.g. any ISO-8859, or CP12, or whatever) and includes anything outside the ASCII 7-bit table, then there is no way whatsoever,* if you try to treat that data as utf8, for any of those wide characters to come out as "the same character". In other words, there is no wide character defined in utf8 such that the sequence of bytes representing that utf8 character is identical to the bytes representing the same (linguistic) character in any non-unicode encoding.	[reply]
Re^2: XML File Encoding and Parsing Problem by merrymonk (Hermit) on Mar 08, 2006 at 09:21 UTC
Thanks for the information and link. I get the strong impression that the root of the problem is that the degree sign has been written as such. I belevie its 'number' is greater than 0x80 which is causing thigs to fail.	[reply]
Re: XML File Encoding and Parsing Problem by graff (Chancellor) on Mar 08, 2006 at 00:35 UTC
You're saying: When the first line is `<?xml version="1.0" encoding="ISO-8859-1" ?>` [download] everything works well That's kind of like when the guy tells his doctor, "It only hurts when I to this...", to which the doctor replies, "Well, don't do that. (That'll be $50 for the visit.)" Why assert that the xml file is utf8 when it's actually iso-8859-1? Is there a reason why you would want the xml file to really be utf8? Or maybe what you want is, after reading an iso-8859-1 xml file, to output something as utf8 data? If you really want utf8 data in your xml, you might need to tell us more about how you are writing the xml file. If you just want to read the xml file as-is and output utf8 data, that's easy. ~~After reading/parsing the xml file correctly, perl has the text stored internally (in memory) as utf8 strings.~~ (update: I'm not actually sure whether a non-utf8 xml file would automatically be converted to utf8 strings upon being parsed; you might need to explicitly "decode" the text in order to convert it to utf8; in that case, since you already know what the original (non-unicode) character set is, converting to utf8 is still really simple -- refer to the Encode module. Then, to output the data as utf8, ...) Just set whatever output file handle to utf8 mode in order to print the text as utf8 data: `binmode $output_file_handle, ":utf8";` [download] (where the first arg to binmode could be STDOUT, or any similar file handle that you've opened for output).	[reply] [d/l] [select]
Re^2: XML File Encoding and Parsing Problem by merrymonk (Hermit) on Mar 08, 2006 at 09:18 UTC
Thanks however I should have explained that I am working with an XML file written by someone else. Therefore I do not have any control over the encoding that they want to use.	[reply]
Re^3: XML File Encoding and Parsing Problem by Anonymous Monk on Mar 08, 2006 at 12:27 UTC
Again, when they are outputting the octet in hex notation `b0` under the encoding UTF-8 then this is illegal. `b0` functions in ISO-8859-1 and ISO-8859-15 as degree sign, but not in UTF-8. In UTF-8 this is `c2b0`. You can tell them if they are too stupid to do it correctly, they can use entities instead: `°` works regardless of encoding.	[reply]
Re: XML File Encoding and Parsing Problem by Aristotle (Chancellor) on Mar 11, 2006 at 07:34 UTC
Do yourself a favour and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) as well as Characters vs. Bytes. Makeshifts last the longest.	[reply]