Re: XML Simple Charset Q?
by mirod (Canon) on Nov 25, 2002 at 17:44 UTC
|
You have to tell the XML parser used by XML::Simple that your data is in ISO-8859-1 (that's latin1 for the rest of us), otherwise your data is NOT XML.
Add this XML declaration at the top of your XML file:
<?xml version="1.0" encoding="ISO-8859-1"?>
But don't think that's enough... the parser (expat) will convert your data to utf8, so when you output it you might want to convert it back to latin1. Look at Unicode and locales for a recent thread on the subject.
| [reply] [d/l] |
|
|
The problem here is I'm trying to process many such snippets of XML for output as HTML. I suspect its easier in this case to do a substitution regex with the /e parameter instead of going through all the mungeing back from UTF-8.
s/([\x80-\xff])/'&#'.ord($1).';'/eg appears to work for all the characters I care about.
Update XML::Parser still insists on converting &#NNN; to UTF-8! I didn't notice as mozilla cunningly noted the changed page encoding and displayed automagically as UTF-8. Mutter Mutter Curse Curse - this is a major pain as I'd like the page to remain Latin-1.
Dingus Enter any 47-digit prime number to continue.
| [reply] [d/l] [select] |
|
|
Since the codes for Latin-1 are the same as Unicode for the first 256 values, that should work (you need to re-encode the values but don't need to translate them though a table). That is, if "use utf8" is not in scope when the regex is compiled. I don't know about Perl 5.8, which reportedly doesn't need the utf8 pragma—you might need some other way to refer to those character on the input.
Anyway, you can use the same light-weight trick to convert back. s/([\x{80}-\x{ff}])/pack('C',$1)/eg Compiled with utf8 in effect (note the curlies on the \x codes. This indicates UTF-8 encoded characters). Then use pack instead of chr so you can specify bytes (chr does too much DWIMary and the persuasion thing is not as transparant as one would hope when dealing with I/O, though I think it's behavior in 5.6 would work in this case).
—John
| [reply] [d/l] |
|
|
|
|
Re: XML Simple Charset Q?
by mirod (Canon) on Nov 25, 2002 at 18:53 UTC
|
OK, so first your version of XML::Parser is _OLD_. Keep it only if you are on Windows (PPM depends on it and it might not be wise to change it).
Then search the site for ways to convert from utf8, there are plenty that work (Encode with 5.8.0, Text::Iconv if you have it, Unicode::* ...
Then (sorry grantm, I did not want to push XML::Twig but they are forcing me too ;--) you can always use XML::Twig with the keep_encoding option that will keep the data in its original encoding.
| [reply] |
Re: XML Simple Charset Q?
by pg (Canon) on Nov 25, 2002 at 18:17 UTC
|
Yes, you can use umlauts in your xml, and XML::Parser is okay with them. Just do two things:
- When you new your XML::Parser, specify
ProtocolEncoding => "Latin-1"
- If you don't have a file called Latin-1.enc under your XML/Parser/Encodings directory, get it from somewhere or make one for yourself. If you already have it, you are ready to go now.
| [reply] [d/l] |
|
|
ProtocolEncoding
This is an Expat option. This sets the protocol encoding name.
It defaults to none. The built-in encodings are: "UTF-8",
"ISO-8859-1", "UTF-16", and "US-ASCII". Other encodings may be
used if they have encoding maps in one of the directories in
the @Encoding_Path list. Check the section on "ENCODINGS" for
more information on encoding maps. Setting the protocol encod-
ing overrides any encoding in the XML declaration.
| [reply] |
|
|
Please, please, please do not use the ProtocolEncoding option. As mirod said, if your source XML document a) does not declare an encoding and b) is not UTF8 (or UTF16) encoded, then it is not XML! The two preferred options are:
- If you are generating the XML, then you need to include an XML declaration which specifies the encoding
- If the XML is being generated by someone else, then you need to reject it since it is not well formed.
Sure, you might guess that the encoding is ISO-8859-1 and it might seem to work if you force it with ProtocolEncoding, but the encoding might actually be CP1252 and the differences haven't tripped you up - yet.
The encodings section of the Perl XML FAQ may be useful.
| [reply] |
|
|
| [reply] |
|
|
Any advice on where to find these protocol/encoding sections, or how they should look?
I spend a lot of time tacking on the headers as suggested earlier in the thread, and I'd like to learn a little more about how expat and XML::Parser deal with encodings -- specifically, how they're mapped.
Suggestions?
| [reply] |
OT: Re: XML Simple Charset Q?
by talexb (Chancellor) on Nov 26, 2002 at 15:36 UTC
|
This is perhaps off-topic, but I was wondering why your XML is not as follows:
<rec id = 'F600' type = 'J'>
<author>A. S. Bommarius</author>
<author>K. Drauz</author>
<author>W. Hummel</author>
<author>M.-R. Kula</author>
<author>C. Wandrey</author>
(snippage)
</rec>
--t. alex
but my friends call me T.
| [reply] [d/l] |
|
|
| [reply] |