in reply to Re^3: Unicode2ascii
in thread Unicode2ascii

Thanks ikegami and jbert. I don't get it, but will probably have to dig around for a tutorial on the web. Thanks. It sounds like you're saying that some files have a "secret decoder byte" (or bytes) at the very beginning of the file that say what the file's encoding is (i.e., ascii, iso-latin-1, UTF-8, Unicode). Maybe the editor doesn't show this byte (my guess is that it's something between 128 and 255 -- something an editor wouldn't draw on the screen anyway). But then, it would still be considered real data to various command line utils... Hmm...

Replies are listed 'Best First'.
Re^5: Unicode2ascii
by jbert (Priest) on Nov 28, 2006 at 18:14 UTC
    Very close. THe "secret decoder bytes" (BOM) is a unicode-specific thing. It doesn't apply to files encoded in ASCII or ISO-Latin-1. This is one thing which limits it's usefulness. I think it's also only commonly used when using UCS-2 on Windows.

    Notepad's default behaviour is to try and open the file as "unicode". What this means is that it looks for the BOM at the beginning. If it finds it, it will be able to determine the encoding used and will then also hide it from you. (i.e. if it recognises the BOM it doesn't display it). If the file doesn't have a BOM, notepad calls the win32 "guess encoding" routine we've already mentioned.

    And you're absolutely right that other tools will treat the BOM as data. This is a problem with in-band signalling in general. If you dig out a hex editor (or write one in perl :-) you should be able to see the BOM at the beginning of a text file which you've saved as "Unicode" in notepad. Be sure to binmode the filehandle if you're writing a hex dumper in perl - otherwise you'll get CRLF translation going on.

    Also - to clarify, "Unicode" by itself isn't really an encoding (although you can be forgiven for thinking so from the terminology used in the Windows world). It's a list of characters, which are given names and numbers to identify them. (e.g. Latin Capital Letter A with Macron) The numbers don't define an encoding on their own, since there isn't a single way to map the unicode number to a sequence of bytes.

    In the "Good Old Days" of single-byte encodings (e.g. iso-latin-1, ascii), a "character set" was both a list of characters and an encoding, because the number of the character was also it's byte encoding. Unicode seperates these two concepts..."Unicode" is the list of characters (with an associated number) and the various encodings (UTF8, UCS-2, UTF16, etc) specify how to convert that Unicode number to a sequence of bytes and back. (This latter step wasn't needed in the days of 0->255 character sets, since a character's identifying number just stood for itself in the byte stream).

    On Windows "Unicode" generally means "UCS-2 encoded Unicode". (And to be precise it really refers to "The subset of unicode which can be represented in UCS-2").

    jbert's rule: any charset/encoding handling or timezone code written by yourself or your co-workers is buggy. Always pull in an external library.

      Ok. I just read the Joel article linked to from the perlunitut page that shmem linked to. So things are a little clearer now. :)

      THe "secret decoder bytes" (BOM) is a unicode-specific thing. {snip} I think it's also only commonly used when using UCS-2 on Windows.

      Ah. Ok. Incidentally, I'm not running MS Windows, but rather am using Emacs on GNU/Linux, occasionally making use of ncurses-hexedit. Emacs, running as a GUI under X, happens to have a little area where you can hover the mouse and it tells you encoding information, but I've only ever seen it tell me ascii or iso-latin-1.

      Also - to clarify, "Unicode" by itself isn't really an encoding {snip}

      Ah. Now things are clearer. I see now that Unicode is simply a character set (where each character has a number associated with it (a so-called "code point")). And, as you point out, there's any number of ways you can encode it.

      "Unicode" is the list of characters (with an associated number) and the various encodings (UTF8, UCS-2, UTF16, etc) specify how to convert that Unicode number to a sequence of bytes and back.

      Very good.

      Most interesting to me is that UTF-8 is a *Unicode* encoding. Now things make a bit more sense. :)

      I typically use GNU/Linux systems, and will look into what's involved with properly setting them up to use UTF-8. Thanks again!