in reply to XML File Encoding and Parsing Problem

You're saying:
When the first line is
<?xml version="1.0" encoding="ISO-8859-1" ?>
everything works well

That's kind of like when the guy tells his doctor, "It only hurts when I to this...", to which the doctor replies, "Well, don't do that. (That'll be $50 for the visit.)"

Why assert that the xml file is utf8 when it's actually iso-8859-1? Is there a reason why you would want the xml file to really be utf8? Or maybe what you want is, after reading an iso-8859-1 xml file, to output something as utf8 data?

If you really want utf8 data in your xml, you might need to tell us more about how you are writing the xml file. If you just want to read the xml file as-is and output utf8 data, that's easy. After reading/parsing the xml file correctly, perl has the text stored internally (in memory) as utf8 strings.

(update: I'm not actually sure whether a non-utf8 xml file would automatically be converted to utf8 strings upon being parsed; you might need to explicitly "decode" the text in order to convert it to utf8; in that case, since you already know what the original (non-unicode) character set is, converting to utf8 is still really simple -- refer to the Encode module. Then, to output the data as utf8, ...)

Just set whatever output file handle to utf8 mode in order to print the text as utf8 data:

binmode $output_file_handle, ":utf8";
(where the first arg to binmode could be STDOUT, or any similar file handle that you've opened for output).

Replies are listed 'Best First'.
Re^2: XML File Encoding and Parsing Problem
by merrymonk (Hermit) on Mar 08, 2006 at 09:18 UTC
    Thanks however I should have explained that I am working with an XML file
    written by someone else.
    Therefore I do not have any control over the encoding that they want to use.

      Again, when they are outputting the octet in hex notation b0 under the encoding UTF-8 then this is illegal. b0 functions in ISO-8859-1 and ISO-8859-15 as degree sign, but not in UTF-8. In UTF-8 this is c2b0.

      You can tell them if they are too stupid to do it correctly, they can use entities instead: &#xb0; works regardless of encoding.