http://qs1969.pair.com?node_id=85338


in reply to Converting character encodings

I read that XML was always in Unicode. Specifically, encoding was always UTF-8 or UTF-16. Has this been changed since that book was printed, or do people just do it anyway since the attribute is there?

IAC, the problem of converting from UTF-8 (internal to the script) to whatever encoding the caller wants is rather general.

Replies are listed 'Best First'.
Re: What are you expecting XML to be in?
by merlyn (Sage) on Jun 03, 2001 at 21:56 UTC
      That's a proper subset of UTF-8, so not really necessary. Can a particular XML file be represented in, say, 8859-6 or JIS-X, and still be standard? I don't like this because it means that a file can't be read unless the parser knows that character set.
        7-bit ISO-8859-1 (also called "ASCII" {grin}) is a proper subset of UTF-8, but not 8-bit ISO-8859-1. So yes, you'd need to declare the file as ISO-8859-1 if you wanted to have any "second half" characters, but otherwise you can let it default to UTF-8.

        -- Randal L. Schwartz, Perl hacker

Re: What are you expecting XML to be in?
by mirod (Canon) on Jun 03, 2001 at 22:29 UTC

    Actually XML uses UTF-8 or UTF-16 by default (and has ways to figure out which one is used), but allows any encoding, as long as it is specified in the XML declaration (as <?xml version="1.0" encoding="whatever"?>). The parser then has to deal with the encoding.

    It is an implementation choice in expat (and then in XML::Parser) that all strings are passed to the handlers in UTF-8, but I don't think the XML spec mandates this choice.

    And because the environment in which the XML is used often does not support UTF-8, but rather latin 1 or shift-JIS or whatever it is often very important (and painful!) to convert all strings back to their original encoding.