comment on

In theory XML files without indication of encoding are encoded as UTF-8 by default.

In practice I've had "XML" files that claim to be in one encoding, but they turn out to be in another.

I hate all this pseudo-XML. XML was rigid in what it accepted from the start with a reason: to force people to produce valid XML. But more and more I see this watering down: people that claim to produce XML, but actually, their XML exporting program contains bugs and their file only superficially looks like XML. And more and more, they're getting away with it. Argh!

If the XML is valid, you don't have to worry, the XML parser will process it properly and transcode the character sets for you. But it's becoming more and more common that you'll have to fix it, before it becomes parseable. And in that case, you'll have to check the likeliness of an encoding. At first I'd second Corion's suggestion of using Encode::Guess, but on second look, and scanning through the docs, I'm thinking the problems you're likely to encounter in practice, are usually too subtle for this module to catch. Very often you get ISO-8859 related encodings, single byte character sets that extend ASCII, and what they give you contains characters that are not in the indicated character set. A typical example is that they claim the character set is ISO-Latin-1 while it contains bytes that are only used in CP-1252 (AKA Windows Latin-1) which is a superset of ISO-Latin-1.

So, you're more or less forced to check what bytes the file contains, and see what character set they're most likely a part of. It's usually safe to replace ISO-Latin-1 with CP-1252. But if you find you end up with words/strings that are not properly decoded, you'll have to tweak that guess.

In the generic case, you could apply heuristic guesses: in real world text files, an Euro symbol ("€") is more likely to occur than a dotted "y" ("ÿ"), for example.

At least, XML sources are fairly consistent: if one of their files is actually in ISO-8859-15 instead of in ISO-8859-1, it's safe to assume all their files will use the same encoding. So it's not absolutely necessary to apply the heuristics to every single of their files, especially as long as they're produced by the same program.

In reply to Re: How to check the encoding format of an XML by bart
in thread How to check the encoding format of an XML by rellaboyina

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.