XML Simple Charset Q?

dingus has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: XML Simple Charset Q? by mirod (Canon) on Nov 25, 2002 at 17:44 UTC
You have to tell the XML parser used by XML::Simple that your data is in ISO-8859-1 (that's latin1 for the rest of us), otherwise your data is NOT XML. Add this XML declaration at the top of your XML file: `<?xml version="1.0" encoding="ISO-8859-1"?>` But don't think that's enough... the parser (`expat`) will convert your data to utf8, so when you output it you might want to convert it back to latin1. Look at Unicode and locales for a recent thread on the subject.	[reply] [d/l]
Re: Re: XML Simple Charset Q? by dingus (Friar) on Nov 25, 2002 at 18:16 UTC
The problem here is I'm trying to process many such snippets of XML for output as HTML. I suspect its easier in this case to do a substitution regex with the /e parameter instead of going through all the mungeing back from UTF-8. `s/([\x80-\xff])/'&#'.ord($1).';'/eg` appears to work for all the characters I care about. Update XML::Parser still insists on converting `&#NNN;` to UTF-8! I didn't notice as mozilla cunningly noted the changed page encoding and displayed automagically as UTF-8. Mutter Mutter Curse Curse - this is a major pain as I'd like the page to remain Latin-1. Dingus `Enter any 47-digit prime number to continue.`	[reply] [d/l] [select]
Re: Re: Re: XML Simple Charset Q? by John M. Dlugosz (Monsignor) on Nov 25, 2002 at 19:52 UTC
Since the codes for Latin-1 are the same as Unicode for the first 256 values, that should work (you need to re-encode the values but don't need to translate them though a table). That is, if "use utf8" is not in scope when the regex is compiled. I don't know about Perl 5.8, which reportedly doesn't need the utf8 pragma—you might need some other way to refer to those character on the input. Anyway, you can use the same light-weight trick to convert back. `s/([\x{80}-\x{ff}])/pack('C',$1)/eg` Compiled with utf8 in effect (note the curlies on the \x codes. This indicates UTF-8 encoded characters). Then use pack instead of chr so you can specify bytes (chr does too much DWIMary and the persuasion thing is not as transparant as one would hope when dealing with I/O, though I think it's behavior in 5.6 would work in this case). —John	[reply] [d/l]
Re: Re: Re: Re: XML Simple Charset Q? by jkahn (Friar) on Nov 25, 2002 at 20:03 UTC
Re: Re: Re: Re: Re: XML Simple Charset Q? by John M. Dlugosz (Monsignor) on Nov 25, 2002 at 21:59 UTC
Re: XML Simple Charset Q? by mirod (Canon) on Nov 25, 2002 at 18:53 UTC
OK, so first your version of XML::Parser is _OLD_. Keep it only if you are on Windows (PPM depends on it and it might not be wise to change it). Then search the site for ways to convert from utf8, there are plenty that work (Encode with 5.8.0, Text::Iconv if you have it, Unicode::* ... Then (sorry grantm, I did not want to push XML::Twig but they are forcing me too ;--) you can always use XML::Twig with the `keep_encoding` option that will keep the data in its original encoding.	[reply]
Re: XML Simple Charset Q? by pg (Canon) on Nov 25, 2002 at 18:17 UTC
Yes, you can use umlauts in your xml, and XML::Parser is okay with them. Just do two things: When you new your XML::Parser, specify `ProtocolEncoding => "Latin-1"` If you don't have a file called Latin-1.enc under your XML/Parser/Encodings directory, get it from somewhere or make one for yourself. If you already have it, you are ready to go now.	[reply] [d/l]
Re: Re: XML Simple Charset Q? by mirod (Canon) on Nov 25, 2002 at 18:32 UTC
If you don't have a file called Latin-1.enc under your XML/Parser/Encodings directory, get it from somewhere or make one for yourself. If you already have it, you are ready to go now. Actually there is no such file in the `Encodings` directory and there is no need for one. ISO-8859-1 is understood by `expat` natively: From XML::Parser doc: ProtocolEncoding This is an Expat option. This sets the protocol encoding name. It defaults to none. The built-in encodings are: "UTF-8", "ISO-8859-1", "UTF-16", and "US-ASCII". Other encodings may be used if they have encoding maps in one of the directories in the @Encoding_Path list. Check the section on "ENCODINGS" for more information on encoding maps. Setting the protocol encod- ing overrides any encoding in the XML declaration.	[reply]
Re: Re: XML Simple Charset Q? by grantm (Parson) on Nov 25, 2002 at 22:25 UTC
Please, please, please do not use the ProtocolEncoding option. As mirod said, if your source XML document a) does not declare an encoding and b) is not UTF8 (or UTF16) encoded, then it is not XML! The two preferred options are: If you are generating the XML, then you need to include an XML declaration which specifies the encoding If the XML is being generated by someone else, then you need to reject it since it is not well formed. Sure, you might guess that the encoding is ISO-8859-1 and it might seem to work if you force it with ProtocolEncoding, but the encoding might actually be CP1252 and the differences haven't tripped you up - yet. The encodings section of the Perl XML FAQ may be useful.	[reply]
Re: Re: XML Simple Charset Q? by dingus (Friar) on Nov 25, 2002 at 18:29 UTC
1. Where the heck do I find a latin-1.enc file? google is ot my friend right now :( 2. Does this end up with UTF-8 output anyway? - see my update to my reply to mirod above. Dingus `Enter any 47-digit prime number to continue.`	[reply]
Re: Re: XML Simple Charset Q? by jkahn (Friar) on Nov 25, 2002 at 18:31 UTC
Any advice on where to find these protocol/encoding sections, or how they should look? I spend a lot of time tacking on the headers as suggested earlier in the thread, and I'd like to learn a little more about how `expat` and XML::Parser deal with encodings -- specifically, how they're mapped. Suggestions?	[reply]
OT: Re: XML Simple Charset Q? by talexb (Chancellor) on Nov 26, 2002 at 15:36 UTC
This is perhaps off-topic, but I was wondering why your XML is not as follows: `<rec id = 'F600' type = 'J'> <author>A. S. Bommarius</author> <author>K. Drauz</author> <author>W. Hummel</author> <author>M.-R. Kula</author> <author>C. Wandrey</author> (snippage) </rec>` [download] --t. alex but my friends call me T.	[reply] [d/l]
Re: OT: Re: XML Simple Charset Q? by dingus (Friar) on Nov 26, 2002 at 16:00 UTC
Because its output from endnote and I'd have to go and split the single author field that I get. Since, for the application I'm writing, we don't want to sort by author, just search on and display the author list, I can't be bothered to split the field up and then have to reintegrate it for the display. (Its a good question though - and I have thought about it, if I get my other XML entity question sorted I may revisit this as there could be sme advanatage if I did this and used XML Twig) Dingus `Enter any 47-digit prime number to continue.`	[reply]