in reply to How to check the encoding format of an XML
In practice I've had "XML" files that claim to be in one encoding, but they turn out to be in another.
I hate all this pseudo-XML. XML was rigid in what it accepted from the start with a reason: to force people to produce valid XML. But more and more I see this watering down: people that claim to produce XML, but actually, their XML exporting program contains bugs and their file only superficially looks like XML. And more and more, they're getting away with it. Argh!
If the XML is valid, you don't have to worry, the XML parser will process it properly and transcode the character sets for you. But it's becoming more and more common that you'll have to fix it, before it becomes parseable. And in that case, you'll have to check the likeliness of an encoding. At first I'd second Corion's suggestion of using Encode::Guess, but on second look, and scanning through the docs, I'm thinking the problems you're likely to encounter in practice, are usually too subtle for this module to catch. Very often you get ISO-8859 related encodings, single byte character sets that extend ASCII, and what they give you contains characters that are not in the indicated character set. A typical example is that they claim the character set is ISO-Latin-1 while it contains bytes that are only used in CP-1252 (AKA Windows Latin-1) which is a superset of ISO-Latin-1.
So, you're more or less forced to check what bytes the file contains, and see what character set they're most likely a part of. It's usually safe to replace ISO-Latin-1 with CP-1252. But if you find you end up with words/strings that are not properly decoded, you'll have to tweak that guess.
In the generic case, you could apply heuristic guesses: in real world text files, an Euro symbol ("€") is more likely to occur than a dotted "y" ("ÿ"), for example.
At least, XML sources are fairly consistent: if one of their files is actually in ISO-8859-15 instead of in ISO-8859-1, it's safe to assume all their files will use the same encoding. So it's not absolutely necessary to apply the heuristics to every single of their files, especially as long as they're produced by the same program.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How to check the encoding format of an XML
by Anonymous Monk on Apr 15, 2010 at 09:23 UTC | |
by bart (Canon) on May 02, 2010 at 21:09 UTC | |
by ikegami (Patriarch) on May 03, 2010 at 00:12 UTC | |
by Anonymous Monk on May 03, 2010 at 09:00 UTC |