in reply to [RFC] How to reverse a (text) string
Consider the letter "Ä", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS. Converting one representation into the other is called "Unicode (de)normalization".So far, so good -- but get rid of the "(de)": it's just "normalization", period.
I think it would make more sense to use "NFKC", which is described in the Unicode::Normalize man page as "compatibility decomposition followed by canonical composition" (emphasis in the original). NFKC is the function that yields "maximally composed" forms for all characters (i.e. the minimum number of separate, non-spacing diacritic marks adjacent to the characters they combine with), and does so according to canonical rules.... use Unicode::Normalize; ... mydump $str; mydump NFKD($str); mydump scalar reverse NFKD($str);
Of course, there is still the problem that some languages commonly need characters with diacritics for which Unicode does not (yet) define a single combined character form -- e.g. some Central African languages use "E ACUTE WITH DOT BELOW"; but there is no such Unicode character, so this must be made using "E ACUTE" followed by "COMBINING DOT BELOW", or using "E WITH DOT BELOW" followed by "COMBINING ACUTE ACCENT". (If you apply NFKC to both of those combinations, only one of them will come out as the "canonical" form -- I forget which...)
I know it can be tough to get one's head around the descriptions provided in the Unicode::Normalize man page, but in essence, NFKD makes sure that all diacritic marks are expanded out to separate code points, which sort of runs counter to what you might want for doing "reverse".
For situations where combining marks are unavoidable (where NFKC cannot eliminate combining marks completely), using NFKD probably would not really help in any way -- where it has any effect at all, it would result in taking more steps (doing more iterations) to solve the basic reversal problem.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: [RFC] How to reverse a (text) string
by moritz (Cardinal) on Dec 19, 2007 at 20:43 UTC | |
|
Re^2: [RFC] How to reverse a (text) string
by Jenda (Abbot) on Dec 19, 2007 at 21:31 UTC | |
by graff (Chancellor) on Dec 20, 2007 at 03:16 UTC | |
by moritz (Cardinal) on Dec 19, 2007 at 22:54 UTC | |
by Jenda (Abbot) on Dec 19, 2007 at 23:28 UTC | |
by graff (Chancellor) on Dec 20, 2007 at 04:10 UTC |