Very close. THe "secret decoder bytes" (BOM) is a unicode-specific thing. It doesn't apply to files encoded in ASCII or ISO-Latin-1. This is one thing which limits it's usefulness. I think it's also only commonly used when using UCS-2 on Windows.
Notepad's default behaviour is to try and open the file as "unicode". What this means is that it looks for the BOM at the beginning. If it finds it, it will be able to determine the encoding used and will then also hide it from you. (i.e. if it recognises the BOM it doesn't display it). If the file doesn't have a BOM, notepad calls the win32 "guess encoding" routine we've already mentioned.
And you're absolutely right that other tools will treat the BOM as data. This is a problem with in-band signalling in general. If you dig out a hex editor (or write one in perl :-) you should be able to see the BOM at the beginning of a text file which you've saved as "Unicode" in notepad. Be sure to binmode the filehandle if you're writing a hex dumper in perl - otherwise you'll get CRLF translation going on.
Also - to clarify, "Unicode" by itself isn't really an encoding (although you can be forgiven for thinking so from the terminology used in the Windows world). It's a list of characters, which are given names and numbers to identify them. (e.g. Latin Capital Letter A with Macron) The numbers don't define an encoding on their own, since there isn't a single way to map the unicode number to a sequence of bytes.
In the "Good Old Days" of single-byte encodings (e.g. iso-latin-1, ascii), a "character set" was both a list of characters and an encoding, because the number of the character was also it's byte encoding. Unicode seperates these two concepts..."Unicode" is the list of characters (with an associated number) and the various encodings (UTF8, UCS-2, UTF16, etc) specify how to convert that Unicode number to a sequence of bytes and back. (This latter step wasn't needed in the days of 0->255 character sets, since a character's identifying number just stood for itself in the byte stream).
On Windows "Unicode" generally means "UCS-2 encoded Unicode". (And to be precise it really refers to "The subset of unicode which can be represented in UCS-2").
jbert's rule: any charset/encoding handling or timezone code written by yourself or your co-workers is buggy. Always pull in an external library. |