in reply to Re: Unicode2ascii
in thread Unicode2ascii

jbert, when you open a file in your editor, how does the editor know whether two bytes next to each other represent 2 separate characters or one "utf8" character?

For that matter, if you open a file that contains unicode encoded characters, how can it tell? If it's just a file full of bytes, wouldn't your editor just try and display each byte as its ascii representation?

Replies are listed 'Best First'.
Re^3: Unicode2ascii
by ikegami (Patriarch) on Nov 28, 2006 at 15:36 UTC

    jbert, when you open a file in your editor, how does the editor know whether two bytes next to each other represent 2 separate characters or one "utf8" character?

    Simple Answer: It doesn't. Depending on the editor, it either needs to be told, requires a specific format, or requires the file to be in the encoding used by the system.

    Complex Answer: Editors can tell the difference between the different unicode encodings (but not non-unicode encodings) *if* the file starts with a Byte Order Mark. File::BOM can help you in that case.

    Update: Added to the simple answer.

Re^3: Unicode2ascii
by jbert (Priest) on Nov 28, 2006 at 15:57 UTC
    In general, it can't. But you may have system-wide defaults/policies/hints.

    On windows, it guesses, sometimes wrongly. This is the origin of the notepad bug stories which come up from time to time. (There is a Windows API function which looks at the byte stream and tries to guess. Notepad calls this function, but it isn't reliable on short, even-length strings of ASCII).

    It's also a bit more complex than that, because you can write a Byte-Order-Mark (two-byte sequence) at the beginning of the text stream, which indicates that the following characters are in a certain encoding, but this is in-band signalling, which kind of sucks, because it only really works if you know the file is already Unicode.

    This area is UTF8's strength. Since ASCII is a strict subset of UTF8, you can treat a stream of bytes as UTF8 and everything will be fine if the stream is actually ASCII. As long as the stream is one of those two, you're OK.

    So there are two main camps:

    Windows: We're slowly moving to two-bytes everywhere UCS-2. People need to guess which encoding is in use.

    Unix: We're moving from ASCII to UTF8. If your app treats text files as containing UTF8 it'll work happily with ASCII or UTF8 files.

      Thanks ikegami and jbert. I don't get it, but will probably have to dig around for a tutorial on the web. Thanks. It sounds like you're saying that some files have a "secret decoder byte" (or bytes) at the very beginning of the file that say what the file's encoding is (i.e., ascii, iso-latin-1, UTF-8, Unicode). Maybe the editor doesn't show this byte (my guess is that it's something between 128 and 255 -- something an editor wouldn't draw on the screen anyway). But then, it would still be considered real data to various command line utils... Hmm...
        Very close. THe "secret decoder bytes" (BOM) is a unicode-specific thing. It doesn't apply to files encoded in ASCII or ISO-Latin-1. This is one thing which limits it's usefulness. I think it's also only commonly used when using UCS-2 on Windows.

        Notepad's default behaviour is to try and open the file as "unicode". What this means is that it looks for the BOM at the beginning. If it finds it, it will be able to determine the encoding used and will then also hide it from you. (i.e. if it recognises the BOM it doesn't display it). If the file doesn't have a BOM, notepad calls the win32 "guess encoding" routine we've already mentioned.

        And you're absolutely right that other tools will treat the BOM as data. This is a problem with in-band signalling in general. If you dig out a hex editor (or write one in perl :-) you should be able to see the BOM at the beginning of a text file which you've saved as "Unicode" in notepad. Be sure to binmode the filehandle if you're writing a hex dumper in perl - otherwise you'll get CRLF translation going on.

        Also - to clarify, "Unicode" by itself isn't really an encoding (although you can be forgiven for thinking so from the terminology used in the Windows world). It's a list of characters, which are given names and numbers to identify them. (e.g. Latin Capital Letter A with Macron) The numbers don't define an encoding on their own, since there isn't a single way to map the unicode number to a sequence of bytes.

        In the "Good Old Days" of single-byte encodings (e.g. iso-latin-1, ascii), a "character set" was both a list of characters and an encoding, because the number of the character was also it's byte encoding. Unicode seperates these two concepts..."Unicode" is the list of characters (with an associated number) and the various encodings (UTF8, UCS-2, UTF16, etc) specify how to convert that Unicode number to a sequence of bytes and back. (This latter step wasn't needed in the days of 0->255 character sets, since a character's identifying number just stood for itself in the byte stream).

        On Windows "Unicode" generally means "UCS-2 encoded Unicode". (And to be precise it really refers to "The subset of unicode which can be represented in UCS-2").

        jbert's rule: any charset/encoding handling or timezone code written by yourself or your co-workers is buggy. Always pull in an external library.