in reply to Re: Re: regex for utf-8
in thread regex for utf-8
UTF-8 is multibyte for codes >= x80, but those multiple bytes always have the high bit set! That's one of the nice features of UTF8. Of course you meant to say [\x80-\xff]. You can also easily tell the number of bytes per character just by looking at the first byte. See this table from RFC2279:
If the high bit is set, then the number of consecutive ones following that is the number of bytes that follow. And all of those start with "10" so you can't confuse them with ASCII characters or with a leading byte of a UTF-8 sequence. Pretty easy!UCS-4 range (hex.) UTF-8 octet sequence (binary) 0000 0000-0000 007F 0xxxxxxx 0000 0080-0000 07FF 110xxxxx 10xxxxxx 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
If you notice, the lead byte of a multibyte sequence is going to be in the range [\xc0-\xfd]
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Re: regex for utf-8
by Anonymous Monk on Feb 28, 2003 at 05:13 UTC | |
by Thelonius (Priest) on Feb 28, 2003 at 16:12 UTC | |
by Anonymous Monk on Feb 28, 2003 at 22:58 UTC | |
by Anonymous Monk on Feb 28, 2003 at 23:06 UTC |