Re: Check UTF8

There isn't enough information to write a program that does so. Files are just streams of bytes. And while many bytestreams can be determined to not be valid UTF-8, the reverse isn't true. For instance, if you have a line in the file with bytes E2 A1 B9, is that a line with the three characters LATIN SMALL LETTER A WITH CIRCUMFLEX, INVERTED EXCLAMATION MARK, SUPERSCRIPT ONE (â¡¹ in Latin-1), or BRAILLE PATTERN DOTS-14567 (⡹in UTF-8). And it maybe something different in one of the hundreds of other encodings that are out there.

So, while you sometimes can determine that a line *isn't* UTF-8 (because not every byte sequence is valid UTF-8), you can never be sure a byte sequence is UTF-8 without additional information.

Comment on Re: Check UTF8

Replies are listed 'Best First'.
Re^2: Check UTF8 by Anonymous Monk on Apr 26, 2011 at 22:00 UTC
True. So tell me: why on earth does the Unicode standard recommend against putting a BOM at the start of a UTF-8 file? Those guys must really like ambiguous data and the quandary it creates for software developers.	[reply]