in reply to XML Simple Charset Q?

You have to tell the XML parser used by XML::Simple that your data is in ISO-8859-1 (that's latin1 for the rest of us), otherwise your data is NOT XML.

Add this XML declaration at the top of your XML file:

<?xml version="1.0" encoding="ISO-8859-1"?>

But don't think that's enough... the parser (expat) will convert your data to utf8, so when you output it you might want to convert it back to latin1. Look at Unicode and locales for a recent thread on the subject.

Replies are listed 'Best First'.
Re: Re: XML Simple Charset Q?
by dingus (Friar) on Nov 25, 2002 at 18:16 UTC
    The problem here is I'm trying to process many such snippets of XML for output as HTML. I suspect its easier in this case to do a substitution regex with the /e parameter instead of going through all the mungeing back from UTF-8.

    s/([\x80-\xff])/'&#'.ord($1).';'/eg
    appears to work for all the characters I care about.

    Update XML::Parser still insists on converting &#NNN; to UTF-8! I didn't notice as mozilla cunningly noted the changed page encoding and displayed automagically as UTF-8. Mutter Mutter Curse Curse - this is a major pain as I'd like the page to remain Latin-1.

    Dingus


    Enter any 47-digit prime number to continue.
      Since the codes for Latin-1 are the same as Unicode for the first 256 values, that should work (you need to re-encode the values but don't need to translate them though a table). That is, if "use utf8" is not in scope when the regex is compiled. I don't know about Perl 5.8, which reportedly doesn't need the utf8 pragma—you might need some other way to refer to those character on the input.

      Anyway, you can use the same light-weight trick to convert back. s/([\x{80}-\x{ff}])/pack('C',$1)/eg Compiled with utf8 in effect (note the curlies on the \x codes. This indicates UTF-8 encoded characters). Then use pack instead of chr so you can specify bytes (chr does too much DWIMary and the persuasion thing is not as transparant as one would hope when dealing with I/O, though I think it's behavior in 5.6 would work in this case).

      —John

        er, it's important to note that though the ISO-8859-1 codepoints are the same as Unicode (below 256), the encodings are not the same (values above 127 are encoded multi-byte in utf-8, but values in ISO-8859-1 are always encoded single-byte).

        I bet most people in this conversation know this, but it's a bit important to clarify. For me, it wasn't so long ago that I didn't know the difference between codepoint and encoding.

        Links to that subject: