in reply to Re: Parsing extended ascii characters using XML::LibXML...
in thread Parsing extended ascii characters using XML::LibXML...


The encoding is specified as UTF-8 in the <?xml?> tag of my XML doc.


You mentioned "since your document is not encoded using UTF-8.". I have specified the encoding as UTF-8 in the <?xml?> tag. How do i "actually" encode the doc in UTF-8?


  • Comment on Re^2: Parsing extended ascii characters using XML::LibXML...

Replies are listed 'Best First'.
Re^3: Parsing extended ascii characters using XML::LibXML...
by almut (Canon) on Oct 28, 2009 at 16:25 UTC
    I have specified the encoding as UTF-8 in the <?xml?> tag.

    Unless the file is in fact encoded as UTF-8, that makes about as much sense as saying "this is green" ... Specifying the desired encoding at the top of the file does not automagically convert to that encoding.

    To convert it you first have to know what the current encoding is.

Re^3: Parsing extended ascii characters using XML::LibXML...
by ikegami (Patriarch) on Oct 28, 2009 at 16:57 UTC

    How do i "actually" encode the doc in UTF-8?

    That depends on how you generate the document. If you create it by hand, it could look something like

    open(my $fh, '>:encoding(UTF-8)', $qfn) or die; print($fh qq{<?xml version="1.0" encoding="UTF-8"?>\n}); print($fh qq{<foo>\n}); print($fh qq{ <bar>}, xml_text($s_7bit), qq{</bar>\n}); print($fh qq{ <bar>}, xml_text($s_8bit), qq{</bar>\n}); print($fh qq{ <bar>}, xml_text($s_32bit), qq{</bar>\n}); print($fh qq{</foo>\n});

    The :encoding PerlIO layer will encode the characters as UTF-8. Without the :encoding layer, the IO system will assume the characters are already encoded (and will freak out if you pass it characters that aren't bytes).

    Update: Fixed bug in code (wasn't using the handle I opened). Added the required prefix to the original code.