in reply to Re: Re: Re: regex for utf-8
in thread regex for utf-8

I understood that Latin-1 is an 8-bit extension to ASCII, and that any code points >= \x80 are represented in multiple bytes. Does the "code points are identical" mean identical once the leading high bit is taken away? Plese explain, I am understanding this but slowly. John

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: regex for utf-8
by John M. Dlugosz (Monsignor) on Feb 28, 2003 at 23:01 UTC
    OK.

    65 is a number. Whether you represent it as "65" the string (which is {0x36 0x35 0} as a C string literal), a twos-complement integer big-endian or little endian, or a 8-byte floating point number, or a BCD class, or whatever, it's still the number sixty five.

    That is, the value is distinct from the representation.

    UTF-8 is a representation. It explains how to take integers in the range 0..231-1 and encode them in a stream of bytes for interchange.

    The "code point" is 65 for capital A. Whether you want to store that as a byte, a nul-terminate string of 16-bit characters, a floating point value, or whatever, it still means the letter A to anyone who can read the representation.

    There is no lookup table in that code, as you noted, because it is changing representations only, not mapping any code points.

    UTF-8 is a variable-length coding system. For values that fit into 7 bits, leave the high bit 0 and emit one byte. So, all legal ASCII is also legal UTF-8! That's a designed-in feature.

    For values that fit in 11 bits, emit 110xxxxx 10yyyyyy as two bytes, where the original number in binary is xxxxxyyyyy. A subset of that gives you 110000xx for the first byte if the value fits in 8 bits. That's what your code is looking for.

    —John

    P.S. how about joining the forum?

      my logging in crossed paths with my posting. Thanks for noting my unlogged status. And thanks for the answer. One other question: Apparently notepad adds a prefix character when converting text to utf-8. I had never heard of that before; is this the famous BOM?
        Must be a new Notepad. The Notepad I know from Windows NT only handles the current ANSI code page or "Unicode" which is saved as UTF-16LE w/BOM and CRLF's for line endings.

        UTF-16 is the way Windows functions that take "Unicode" like it. Well, almost... UCS-2 is 16 bits per code point, period. Full UTF-16 uses a group of 2048 special code points in pairs to represent values over 64K.

        LE is "little endian". In my experience, Notepad doesn't work any other way.

        BOM is the "Byte order mark", or "zero-width non-breaking joiner" which is basically a no-op character. It's code is U+FEFF, and there is no character FFFE. So read the first two bytes of the file, and you can tell whether it's LE or Big Endian.

        That character also has a particular encoding in UTF-8, if you care to figure it out. That can be used as a signature to identifiy UTF-8 files, too.

        Check out Unipad. It can save and load any format or variety. Playing with it might be enlightening.

        Also checkout the Unicode.org site.

        —John