Re^2: [RFC] How to reverse a (text) string

Replies are listed 'Best First'.

Re^3: [RFC] How to reverse a (text) string
by graff (Chancellor) on Dec 20, 2007 at 03:16 UTC

if the transformation in one direction is "normalization", then the transformation in the oposite direction is denormalization, isn't it?

Whether you are "composing" or "decomposing" elements of complex characters, you have to make choices about how the operation is done: which elements will be combined into a composed character (as in the "e acute with dot below" example), or how to order the decomposed elements. When those choices are made according to a codified set of rules (as opposed to your whim of the moment or random selection), that is normalization. The term applies in both directions.

[reply]

Re^3: [RFC] How to reverse a (text) string
by moritz (Cardinal) on Dec 19, 2007 at 22:54 UTC

Erm ... if the transformation in one direction is "normalization", then the transformation in the oposite direction is denormalization, isn't it?

The term dernormalization doesn't occur in either the normalization FAQ nor in the Unicode Normalization Forms report.

Isn't just one of the ways to encode the graphemes supposed to be the "normal form"?

But which one? There are good reasons for all of these forms.

In either case thanks both for uncovering an (what should I call it?!?) interesting feature of Unicode. I had no idea characters in Unicode can be not only multibyte, but also mutlicodepoint. I guess the commitee that invented Unicode was too big.

It's not the committee size, but rather the number of possible graphemes in all languages of the world. If Unicode had Codepoints larger than 2^32 you wouldn't be happy either, would you?

And I think it is a quite natural approach to divide a grapheme into a base character and a decoration.

It's sad that it makes programming harder, but if you oversimplify, you lose correctness.

Sadly Perl 5's builtins don't work on the grapheme level, only Codepoint and Byte level. It's one of the many reasons why I'm looking forward to Perl 6...

[reply]

Re^4: [RFC] How to reverse a (text) string

by Jenda (Abbot) on Dec 19, 2007 at 23:28 UTC

I don't really care how many bytes are needed to STORE a codepoint, but if codepoint doesn't equal character (sorry, grapheme) then there is really no point in using that concept. In UTF-8 (and I believe in most other ways to encode Unicode) some codepoints are longer than others. Some bytes (or byte sequences) are reserved to mean something like "waitasecond ... we are not done yet, read the next N more bytes". Which means I can't asume a byte is a character, I have to understand those reserved codes and take enough bytes to have the whole "character". With Codepoint != Character (OK, grapheme, that sounds cooler) this one level of "you have to understand the data to be able to even just say how long it is" is not enough, 'cause there's yet another level on top of that.

Sorry, that sounds rather ... erm ... interesting.

Separating a grapheme into a base character and a decoration makes some sense in DTP. In case your font doesn't have that character you may construct it out of something it does. (TeX used to do that for Czech accented characters at first.) For anything other it's overcomplication.

I would rather have the list of Unicode codepoints grow (which I believe it does anyway) and KNOW where does a character start and end in the stream of bytes.

Jenda
P.S.: If you overcomplicate you loose correctness as well. Because noone will bother to implement all that stuff correctly.

Support Denmark!
Defend the free world!

[reply]

Re^5: [RFC] How to reverse a (text) string

by graff (Chancellor) on Dec 20, 2007 at 04:10 UTC

The combinatorial problem might seem relatively trivial for this or that language taken on its own, but it can get pretty cumbersome when each of the many syllabic-based scripts requires a few thousand combined forms, built from less than a hundred basic components. And there are actually quite a few text-processing applications where it really helps to have the graphemes expressed in terms of their individual components, because each component tends to have a stable linguistic function or "meaning" in the structure of the language.

(I'm thinking about how Korean is handled -- even when you put aside their use of Chinese ideographs, they still use a lot of code points. Applying that approach to Hebrew, Arabic, Hindi, Bengali, Tibetan, Tamil, and several others is, frankly, not an attractive prospect, IMHO.)

I guess the point is: things will have to be complicated one way or another. If you try to simplify in one area, you end up making things more complicated elsewhere, and vice versa. The existing approach of using combining marks has some nice advantages, and its disadvantages are made a bit less painful by the presence of the Unicode Character Database (this thing is included with perl 5.8 distributions), which lets you look up any code point to see whether its a "letter" or a "combining mark" (or a "number", or "punctuation" or "bracket" or...)

[reply]