biswanath_c has asked for the wisdom of the Perl Monks concerning the following question:


I am having issues while trying to parse an XML document that has exteded ascii characters even though the XML document has specified the encoding as UTF-8 !

I am having this character - "Â" insisde a node in an XML document and when i try to parse the doc using XML::LibXML module, i get this error:
parser error : Input is not proper UTF-8, indicate encoding!<br/> Bytes: 0xC2 0x3C 0x6E 0x6C


Any When i try to get the ascii value of that character using the ord() function, i get the value as 194. How do i handle this character in XML using XML::LibXML?


Replies are listed 'Best First'.
Re: Parsing extended ascii characters using XML::LibXML...
by ikegami (Patriarch) on Oct 27, 2009 at 21:46 UTC

    even though the XML document has specified the encoding as UTF-8 !

    That's a problem, since your document is not encoded using UTF-8.

    "Â" is encoded as bytes "C3 82" in UTF-8.

    "Â" is encoded as byte "C2" in a variety of encodings, including

    • iso-8859-1
    • iso-8859-2
    • iso-8859-3
    • iso-8859-4
    • iso-8859-9
    • iso-8859-10
    • iso-8859-14
    • iso-8859-15
    • Windows-1250
    • Windows-1252

    So which is your document's actual encoding?

    You'll need to specify it in the <?xml?> tag.

    <?xml version="1.0" encoding="..."?> ...

      The encoding is specified as UTF-8 in the <?xml?> tag of my XML doc.


      You mentioned "since your document is not encoded using UTF-8.". I have specified the encoding as UTF-8 in the <?xml?> tag. How do i "actually" encode the doc in UTF-8?


        I have specified the encoding as UTF-8 in the <?xml?> tag.

        Unless the file is in fact encoded as UTF-8, that makes about as much sense as saying "this is green" ... Specifying the desired encoding at the top of the file does not automagically convert to that encoding.

        To convert it you first have to know what the current encoding is.

        How do i "actually" encode the doc in UTF-8?

        That depends on how you generate the document. If you create it by hand, it could look something like

        open(my $fh, '>:encoding(UTF-8)', $qfn) or die; print($fh qq{<?xml version="1.0" encoding="UTF-8"?>\n}); print($fh qq{<foo>\n}); print($fh qq{ <bar>}, xml_text($s_7bit), qq{</bar>\n}); print($fh qq{ <bar>}, xml_text($s_8bit), qq{</bar>\n}); print($fh qq{ <bar>}, xml_text($s_32bit), qq{</bar>\n}); print($fh qq{</foo>\n});

        The :encoding PerlIO layer will encode the characters as UTF-8. Without the :encoding layer, the IO system will assume the characters are already encoded (and will freak out if you pass it characters that aren't bytes).

        Update: Fixed bug in code (wasn't using the handle I opened). Added the required prefix to the original code.