Re^4: [RFC] How to reverse a (text) string

I don't really care how many bytes are needed to STORE a codepoint, but if codepoint doesn't equal character (sorry, grapheme) then there is really no point in using that concept. In UTF-8 (and I believe in most other ways to encode Unicode) some codepoints are longer than others. Some bytes (or byte sequences) are reserved to mean something like "waitasecond ... we are not done yet, read the next N more bytes". Which means I can't asume a byte is a character, I have to understand those reserved codes and take enough bytes to have the whole "character". With Codepoint != Character (OK, grapheme, that sounds cooler) this one level of "you have to understand the data to be able to even just say how long it is" is not enough, 'cause there's yet another level on top of that.

Sorry, that sounds rather ... erm ... interesting.

Separating a grapheme into a base character and a decoration makes some sense in DTP. In case your font doesn't have that character you may construct it out of something it does. (TeX used to do that for Czech accented characters at first.) For anything other it's overcomplication.

I would rather have the list of Unicode codepoints grow (which I believe it does anyway) and KNOW where does a character start and end in the stream of bytes.

Jenda
P.S.: If you overcomplicate you loose correctness as well. Because noone will bother to implement all that stuff correctly.

Support Denmark!
Defend the free world!

Comment on Re^4: [RFC] How to reverse a (text) string

Replies are listed 'Best First'.
Re^5: [RFC] How to reverse a (text) string by graff (Chancellor) on Dec 20, 2007 at 04:10 UTC
I understand your sentiment, but in fairness to the folks creating Unicode, the design process involves a lot of tough calls... The number of languages that use "diacritic" combinations on basic characters is somewhat astonishing, and for the ones that had any sort of pre-existing standard for character encoding, there's the initial problem of the "inertia" from established practice (e.g. collating logic). The combinatorial problem might seem relatively trivial for this or that language taken on its own, but it can get pretty cumbersome when each of the many syllabic-based scripts requires a few thousand combined forms, built from less than a hundred basic components. And there are actually quite a few text-processing applications where it really helps to have the graphemes expressed in terms of their individual components, because each component tends to have a stable linguistic function or "meaning" in the structure of the language. (I'm thinking about how Korean is handled -- even when you put aside their use of Chinese ideographs, they still use a lot of code points. Applying that approach to Hebrew, Arabic, Hindi, Bengali, Tibetan, Tamil, and several others is, frankly, not an attractive prospect, IMHO.) I guess the point is: things will have to be complicated one way or another. If you try to simplify in one area, you end up making things more complicated elsewhere, and vice versa. The existing approach of using combining marks has some nice advantages, and its disadvantages are made a bit less painful by the presence of the Unicode Character Database (this thing is included with perl 5.8 distributions), which lets you look up any code point to see whether its a "letter" or a "combining mark" (or a "number", or "punctuation" or "bracket" or...)	[reply]

Replies are listed 'Best First'.

Re^5: [RFC] How to reverse a (text) string
by graff (Chancellor) on Dec 20, 2007 at 04:10 UTC

The combinatorial problem might seem relatively trivial for this or that language taken on its own, but it can get pretty cumbersome when each of the many syllabic-based scripts requires a few thousand combined forms, built from less than a hundred basic components. And there are actually quite a few text-processing applications where it really helps to have the graphemes expressed in terms of their individual components, because each component tends to have a stable linguistic function or "meaning" in the structure of the language.

(I'm thinking about how Korean is handled -- even when you put aside their use of Chinese ideographs, they still use a lot of code points. Applying that approach to Hebrew, Arabic, Hindi, Bengali, Tibetan, Tamil, and several others is, frankly, not an attractive prospect, IMHO.)

I guess the point is: things will have to be complicated one way or another. If you try to simplify in one area, you end up making things more complicated elsewhere, and vice versa. The existing approach of using combining marks has some nice advantages, and its disadvantages are made a bit less painful by the presence of the Unicode Character Database (this thing is included with perl 5.8 distributions), which lets you look up any code point to see whether its a "letter" or a "combining mark" (or a "number", or "punctuation" or "bracket" or...)

[reply]