in reply to [RFC] How to reverse a (text) string

Consider the letter "Ä", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS. Converting one representation into the other is called "Unicode (de)normalization".
So far, so good -- but get rid of the "(de)": it's just "normalization", period.

... use Unicode::Normalize; ... mydump $str; mydump NFKD($str); mydump scalar reverse NFKD($str);
I think it would make more sense to use "NFKC", which is described in the Unicode::Normalize man page as "compatibility decomposition followed by canonical composition" (emphasis in the original). NFKC is the function that yields "maximally composed" forms for all characters (i.e. the minimum number of separate, non-spacing diacritic marks adjacent to the characters they combine with), and does so according to canonical rules.

Of course, there is still the problem that some languages commonly need characters with diacritics for which Unicode does not (yet) define a single combined character form -- e.g. some Central African languages use "E ACUTE WITH DOT BELOW"; but there is no such Unicode character, so this must be made using "E ACUTE" followed by "COMBINING DOT BELOW", or using "E WITH DOT BELOW" followed by "COMBINING ACUTE ACCENT". (If you apply NFKC to both of those combinations, only one of them will come out as the "canonical" form -- I forget which...)

I know it can be tough to get one's head around the descriptions provided in the Unicode::Normalize man page, but in essence, NFKD makes sure that all diacritic marks are expanded out to separate code points, which sort of runs counter to what you might want for doing "reverse".

For situations where combining marks are unavoidable (where NFKC cannot eliminate combining marks completely), using NFKD probably would not really help in any way -- where it has any effect at all, it would result in taking more steps (doing more iterations) to solve the basic reversal problem.

Replies are listed 'Best First'.
Re^2: [RFC] How to reverse a (text) string
by moritz (Cardinal) on Dec 19, 2007 at 20:43 UTC
    Thanks for your feedback.

    I know that normally you wouldn't use NFKD for normalization to reverse a string, this was just a trick to show that 1) different normalizations can behave make reverse differently and 2) there are cases where it doesn't work the way you want.

    Surely I won't give an introduction to Unicode normalization here, partly because I don't really grok it, partly because it's just too big to fit in a tutorial that focuses on something else.

    Anyway, I hope the tutorial is now a bit clearer on why I used NFKD here, and that it's a normal thing to do.

Re^2: [RFC] How to reverse a (text) string
by Jenda (Abbot) on Dec 19, 2007 at 21:31 UTC

    Erm ... if the transformation in one direction is "normalization", then the transformation in the oposite direction is denormalization, isn't it? Or would that be too logical for Unicode. Isn't just one of the ways to encode the graphemes supposed to be the "normal form"?

    In either case thanks both for uncovering an (what should I call it?!?) interesting feature of Unicode. I had no idea characters in Unicode can be not only multibyte, but also mutlicodepoint. I guess the commitee that invented Unicode was too big.

      if the transformation in one direction is "normalization", then the transformation in the oposite direction is denormalization, isn't it?

      Whether you are "composing" or "decomposing" elements of complex characters, you have to make choices about how the operation is done: which elements will be combined into a composed character (as in the "e acute with dot below" example), or how to order the decomposed elements. When those choices are made according to a codified set of rules (as opposed to your whim of the moment or random selection), that is normalization. The term applies in both directions.

      Erm ... if the transformation in one direction is "normalization", then the transformation in the oposite direction is denormalization, isn't it?

      The term dernormalization doesn't occur in either the normalization FAQ nor in the Unicode Normalization Forms report.

      Isn't just one of the ways to encode the graphemes supposed to be the "normal form"?

      But which one? There are good reasons for all of these forms.

      In either case thanks both for uncovering an (what should I call it?!?) interesting feature of Unicode. I had no idea characters in Unicode can be not only multibyte, but also mutlicodepoint. I guess the commitee that invented Unicode was too big.

      It's not the committee size, but rather the number of possible graphemes in all languages of the world. If Unicode had Codepoints larger than 2^32 you wouldn't be happy either, would you?

      And I think it is a quite natural approach to divide a grapheme into a base character and a decoration.

      It's sad that it makes programming harder, but if you oversimplify, you lose correctness.

      Sadly Perl 5's builtins don't work on the grapheme level, only Codepoint and Byte level. It's one of the many reasons why I'm looking forward to Perl 6...

        I don't really care how many bytes are needed to STORE a codepoint, but if codepoint doesn't equal character (sorry, grapheme) then there is really no point in using that concept. In UTF-8 (and I believe in most other ways to encode Unicode) some codepoints are longer than others. Some bytes (or byte sequences) are reserved to mean something like "waitasecond ... we are not done yet, read the next N more bytes". Which means I can't asume a byte is a character, I have to understand those reserved codes and take enough bytes to have the whole "character". With Codepoint != Character (OK, grapheme, that sounds cooler) this one level of "you have to understand the data to be able to even just say how long it is" is not enough, 'cause there's yet another level on top of that.

        Sorry, that sounds rather ... erm ... interesting.

        Separating a grapheme into a base character and a decoration makes some sense in DTP. In case your font doesn't have that character you may construct it out of something it does. (TeX used to do that for Czech accented characters at first.) For anything other it's overcomplication.

        I would rather have the list of Unicode codepoints grow (which I believe it does anyway) and KNOW where does a character start and end in the stream of bytes.

        Jenda
        P.S.: If you overcomplicate you loose correctness as well. Because noone will bother to implement all that stuff correctly.
        Support Denmark!
        Defend the free world!