I don't really care how many bytes are needed to STORE a codepoint, but if codepoint doesn't equal character (sorry, grapheme) then there is really no point in using that concept. In UTF-8 (and I believe in most other ways to encode Unicode) some codepoints are longer than others. Some bytes (or byte sequences) are reserved to mean something like "waitasecond ... we are not done yet, read the next N more bytes". Which means I can't asume a byte is a character, I have to understand those reserved codes and take enough bytes to have the whole "character". With Codepoint != Character (OK, grapheme, that sounds cooler) this one level of "you have to understand the data to be able to even just say how long it is" is not enough, 'cause there's yet another level on top of that.
Sorry, that sounds rather ... erm ... interesting.
Separating a grapheme into a base character and a decoration makes some sense in DTP. In case your font doesn't have that character you may construct it out of something it does. (TeX used to do that for Czech accented characters at first.) For anything other it's overcomplication.
I would rather have the list of Unicode codepoints grow (which I believe it does anyway) and KNOW where does a character start and end in the stream of bytes.
Jenda
P.S.: If you overcomplicate you loose correctness as well. Because noone will bother to implement all that stuff correctly.
|
Support Denmark! Defend the free world! |
In reply to Re^4: [RFC] How to reverse a (text) string
by Jenda
in thread [RFC] How to reverse a (text) string
by moritz
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |