comment on

Consider the letter "Ä", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS. Converting one representation into the other is called "Unicode (de)normalization".

So far, so good -- but get rid of the "(de)": it's just "normalization", period.

...
use Unicode::Normalize;
...
mydump $str;
mydump NFKD($str);
mydump scalar reverse NFKD($str);
[download]

I think it would make more sense to use "NFKC", which is described in the Unicode::Normalize man page as "compatibility decomposition followed by canonical composition" (emphasis in the original). NFKC is the function that yields "maximally composed" forms for all characters (i.e. the minimum number of separate, non-spacing diacritic marks adjacent to the characters they combine with), and does so according to canonical rules.

Of course, there is still the problem that some languages commonly need characters with diacritics for which Unicode does not (yet) define a single combined character form -- e.g. some Central African languages use "E ACUTE WITH DOT BELOW"; but there is no such Unicode character, so this must be made using "E ACUTE" followed by "COMBINING DOT BELOW", or using "E WITH DOT BELOW" followed by "COMBINING ACUTE ACCENT". (If you apply NFKC to both of those combinations, only one of them will come out as the "canonical" form -- I forget which...)

I know it can be tough to get one's head around the descriptions provided in the Unicode::Normalize man page, but in essence, NFKD makes sure that all diacritic marks are expanded out to separate code points, which sort of runs counter to what you might want for doing "reverse".

For situations where combining marks are unavoidable (where NFKC cannot eliminate combining marks completely), using NFKD probably would not really help in any way -- where it has any effect at all, it would result in taking more steps (doing more iterations) to solve the basic reversal problem.

In reply to Re: [RFC] How to reverse a (text) string by graff
in thread [RFC] How to reverse a (text) string by moritz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.