in reply to How to check the encoding format of an XML

In theory XML files without indication of encoding are encoded as UTF-8 by default.

In practice I've had "XML" files that claim to be in one encoding, but they turn out to be in another.

I hate all this pseudo-XML. XML was rigid in what it accepted from the start with a reason: to force people to produce valid XML. But more and more I see this watering down: people that claim to produce XML, but actually, their XML exporting program contains bugs and their file only superficially looks like XML. And more and more, they're getting away with it. Argh!

If the XML is valid, you don't have to worry, the XML parser will process it properly and transcode the character sets for you. But it's becoming more and more common that you'll have to fix it, before it becomes parseable. And in that case, you'll have to check the likeliness of an encoding. At first I'd second Corion's suggestion of using Encode::Guess, but on second look, and scanning through the docs, I'm thinking the problems you're likely to encounter in practice, are usually too subtle for this module to catch. Very often you get ISO-8859 related encodings, single byte character sets that extend ASCII, and what they give you contains characters that are not in the indicated character set. A typical example is that they claim the character set is ISO-Latin-1 while it contains bytes that are only used in CP-1252 (AKA Windows Latin-1) which is a superset of ISO-Latin-1.

So, you're more or less forced to check what bytes the file contains, and see what character set they're most likely a part of. It's usually safe to replace ISO-Latin-1 with CP-1252. But if you find you end up with words/strings that are not properly decoded, you'll have to tweak that guess.

In the generic case, you could apply heuristic guesses: in real world text files, an Euro symbol ("€") is more likely to occur than a dotted "y" ("ÿ"), for example.

At least, XML sources are fairly consistent: if one of their files is actually in ISO-8859-15 instead of in ISO-8859-1, it's safe to assume all their files will use the same encoding. So it's not absolutely necessary to apply the heuristics to every single of their files, especially as long as they're produced by the same program.

  • Comment on Re: How to check the encoding format of an XML

Replies are listed 'Best First'.
Re^2: How to check the encoding format of an XML
by Anonymous Monk on Apr 15, 2010 at 09:23 UTC

    Hi,

    Perhaps this is not the right forum to ask this (it depends how strict you are), but still I think it's useful for a perl programmer to know how to do this without perl.

    The question is: is there any unix/linux command that tells me the encoding format of an xml file?

    I've got xml files that don't claim any particular encoding (<?xml version="1.0" ?>). They are in UCS-2LE but I need to have them in UTF-8 or ANSI.

    This time I could see the encoding opening them in an editor but it would be much handier to check from the command line. The "file" command only tells me "XML document text"

    Cheers and thanks a lot!

    xinelo

      but I need to have them in UTF-8 or ANSI.
      Assuming the XML file is valid — and, since you posted this in a thread where I complained that people often produce invalid XML, that's not necessarily true — I think you can use XSLT, with an identity transform and thus make it produce XML in any encoding you like.
      The question is: is there any unix/linux command that tells me the encoding format of an xml file?
      Uh? Do you still need it, then? Anyway, if you don't mind a solution involving Perl, then Encode::Guess might do the trick.

      They are in UCS-2LE but I need to have them in UTF-8 or ANSI.

      perl -pe' BEGIN { binmode STDIN, ":raw:perlio:encoding(UTF-16le)"; binmode STDOUT, ":raw:perlio:encoding(UTF-8)"; } ' < file.xml > file.utf8.xml

      (UTF-16le is a superset of UCS-2le, so it's safer to use it when decoding.)

      Update: Note that this doesn't fix the encoding= attribute of the <?xml?> directive. But it sounds like your trying to make the encoding match it anyway.

        Or using the open pragma
        perl -Mopen=:std,IN,:raw:perlio:encoding(UTF-16le),OUT,:raw:perlio:enc +oding(UTF-8) -pe 1 < file.xml > file.utf8.xml