in reply to Re^4: [RFC] How to reverse a (text) string
in thread [RFC] How to reverse a (text) string

I understand your sentiment, but in fairness to the folks creating Unicode, the design process involves a lot of tough calls... The number of languages that use "diacritic" combinations on basic characters is somewhat astonishing, and for the ones that had any sort of pre-existing standard for character encoding, there's the initial problem of the "inertia" from established practice (e.g. collating logic).

The combinatorial problem might seem relatively trivial for this or that language taken on its own, but it can get pretty cumbersome when each of the many syllabic-based scripts requires a few thousand combined forms, built from less than a hundred basic components. And there are actually quite a few text-processing applications where it really helps to have the graphemes expressed in terms of their individual components, because each component tends to have a stable linguistic function or "meaning" in the structure of the language.

(I'm thinking about how Korean is handled -- even when you put aside their use of Chinese ideographs, they still use a lot of code points. Applying that approach to Hebrew, Arabic, Hindi, Bengali, Tibetan, Tamil, and several others is, frankly, not an attractive prospect, IMHO.)

I guess the point is: things will have to be complicated one way or another. If you try to simplify in one area, you end up making things more complicated elsewhere, and vice versa. The existing approach of using combining marks has some nice advantages, and its disadvantages are made a bit less painful by the presence of the Unicode Character Database (this thing is included with perl 5.8 distributions), which lets you look up any code point to see whether its a "letter" or a "combining mark" (or a "number", or "punctuation" or "bracket" or...)

  • Comment on Re^5: [RFC] How to reverse a (text) string