in reply to Re^2: Unicode2ascii
in thread Unicode2ascii

In general, it can't. But you may have system-wide defaults/policies/hints.

On windows, it guesses, sometimes wrongly. This is the origin of the notepad bug stories which come up from time to time. (There is a Windows API function which looks at the byte stream and tries to guess. Notepad calls this function, but it isn't reliable on short, even-length strings of ASCII).

It's also a bit more complex than that, because you can write a Byte-Order-Mark (two-byte sequence) at the beginning of the text stream, which indicates that the following characters are in a certain encoding, but this is in-band signalling, which kind of sucks, because it only really works if you know the file is already Unicode.

This area is UTF8's strength. Since ASCII is a strict subset of UTF8, you can treat a stream of bytes as UTF8 and everything will be fine if the stream is actually ASCII. As long as the stream is one of those two, you're OK.

So there are two main camps:

Windows: We're slowly moving to two-bytes everywhere UCS-2. People need to guess which encoding is in use.

Unix: We're moving from ASCII to UTF8. If your app treats text files as containing UTF8 it'll work happily with ASCII or UTF8 files.

Replies are listed 'Best First'.
Re^4: Unicode2ascii
by j3 (Friar) on Nov 28, 2006 at 17:46 UTC
    Thanks ikegami and jbert. I don't get it, but will probably have to dig around for a tutorial on the web. Thanks. It sounds like you're saying that some files have a "secret decoder byte" (or bytes) at the very beginning of the file that say what the file's encoding is (i.e., ascii, iso-latin-1, UTF-8, Unicode). Maybe the editor doesn't show this byte (my guess is that it's something between 128 and 255 -- something an editor wouldn't draw on the screen anyway). But then, it would still be considered real data to various command line utils... Hmm...
      Very close. THe "secret decoder bytes" (BOM) is a unicode-specific thing. It doesn't apply to files encoded in ASCII or ISO-Latin-1. This is one thing which limits it's usefulness. I think it's also only commonly used when using UCS-2 on Windows.

      Notepad's default behaviour is to try and open the file as "unicode". What this means is that it looks for the BOM at the beginning. If it finds it, it will be able to determine the encoding used and will then also hide it from you. (i.e. if it recognises the BOM it doesn't display it). If the file doesn't have a BOM, notepad calls the win32 "guess encoding" routine we've already mentioned.

      And you're absolutely right that other tools will treat the BOM as data. This is a problem with in-band signalling in general. If you dig out a hex editor (or write one in perl :-) you should be able to see the BOM at the beginning of a text file which you've saved as "Unicode" in notepad. Be sure to binmode the filehandle if you're writing a hex dumper in perl - otherwise you'll get CRLF translation going on.

      Also - to clarify, "Unicode" by itself isn't really an encoding (although you can be forgiven for thinking so from the terminology used in the Windows world). It's a list of characters, which are given names and numbers to identify them. (e.g. Latin Capital Letter A with Macron) The numbers don't define an encoding on their own, since there isn't a single way to map the unicode number to a sequence of bytes.

      In the "Good Old Days" of single-byte encodings (e.g. iso-latin-1, ascii), a "character set" was both a list of characters and an encoding, because the number of the character was also it's byte encoding. Unicode seperates these two concepts..."Unicode" is the list of characters (with an associated number) and the various encodings (UTF8, UCS-2, UTF16, etc) specify how to convert that Unicode number to a sequence of bytes and back. (This latter step wasn't needed in the days of 0->255 character sets, since a character's identifying number just stood for itself in the byte stream).

      On Windows "Unicode" generally means "UCS-2 encoded Unicode". (And to be precise it really refers to "The subset of unicode which can be represented in UCS-2").

      jbert's rule: any charset/encoding handling or timezone code written by yourself or your co-workers is buggy. Always pull in an external library.

        Ok. I just read the Joel article linked to from the perlunitut page that shmem linked to. So things are a little clearer now. :)

        THe "secret decoder bytes" (BOM) is a unicode-specific thing. {snip} I think it's also only commonly used when using UCS-2 on Windows.

        Ah. Ok. Incidentally, I'm not running MS Windows, but rather am using Emacs on GNU/Linux, occasionally making use of ncurses-hexedit. Emacs, running as a GUI under X, happens to have a little area where you can hover the mouse and it tells you encoding information, but I've only ever seen it tell me ascii or iso-latin-1.

        Also - to clarify, "Unicode" by itself isn't really an encoding {snip}

        Ah. Now things are clearer. I see now that Unicode is simply a character set (where each character has a number associated with it (a so-called "code point")). And, as you point out, there's any number of ways you can encode it.

        "Unicode" is the list of characters (with an associated number) and the various encodings (UTF8, UCS-2, UTF16, etc) specify how to convert that Unicode number to a sequence of bytes and back.

        Very good.

        Most interesting to me is that UTF-8 is a *Unicode* encoding. Now things make a bit more sense. :)

        I typically use GNU/Linux systems, and will look into what's involved with properly setting them up to use UTF-8. Thanks again!