I haven't found a good, in depth tutorial on string reversal here on perlmonks, so I tried to write one. Please tell me what you think about it, and how it could be improved.

You have written some Perl scripts already, and when somebody asks you how to reverse a string, you'll answer: "That's easy, just call reverse in scalar context".

And of course, you're right - if you're only considering ASCII chars.

But suppose you have an UTF-8 environment:

#!/usr/bin/perl use strict; use warnings; print scalar reverse "\noäu";

The output consists of a "u", two garbage characters, and a newline.

The reason is that "ä", like many other chars, is represented by several bytes in UTF-8, here as 0xC3 0xA4. reverse Works on bytes, so it will produce 0xA4< 0xC3. And that is not legal UTF-8, so the output contains two bytes of garbage.

You can solve this problem by decoding the text strings (read perluniintro and perlunicode for more information):

#!/usr/bin/perl use strict; use warnings; use utf8; binmode STDOUT, ':utf8'; print scalar reverse "\noäu"; __END__ uäo

The use utf8; takes care that every string literal in the script is treated as a text string, so reverse (and other functions like uc) will work on codepoint level.

While this example worked, it could just as well fail.

The reason is that there are multiple ways to encode some characters.

Consider the letter "Ä", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS.

Converting one representation into the other is called "Unicode normalization".

Bad luck, in our case, reverse doesn't work for the normalized form:

#!/usr/bin/perl use strict; use warnings; use utf8; use Unicode::Normalize; use charnames ':full'; my $str = 'Ä'; sub mydump { print map { '\N['. charnames::viacode(ord $_) . ']' } split m//, $_[0]; print "\n"; } mydump $str; mydump NFKD($str); mydump scalar reverse NFKD($str); binmode STDOUT, ':utf8'; my $tmp = "\nÄO"; print scalar reverse NFKD($tmp); __END__ \N[LATIN CAPITAL LETTER A WITH DIAERESIS] \N[LATIN CAPITAL LETTER A]\N[COMBINING DIAERESIS] \N[COMBINING DIAERESIS]\N[LATIN CAPITAL LETTER A] ÖA

You can see that reversing a string moves the combining character(s) to the front, thus they are applied to the wrong base character; "ÄO" reversed becomes "ÖA".

(You wouldn't normalize with NFKD here under normal circumstances, in this example it is done to demonstrate the problem that can arise from such strings).

It seems the problem could easily be solved by not doing the normalization in the first place, and indeed that works in this example. But there are Unicode graphemes that can't be expressed with a single Codepoint, and if one of your users enters such a grapheme, your application won't work correctly.

So we need a "real" solution. Since perl doesn't work with graphemes, we need a CPAN module that does:

#!/usr/bin/perl use strict; use warnings; use utf8; use Unicode::Normalize; use charnames ':full'; use String::Multibyte; my $str = NFKD "ÄO"; sub mydump { print map { '\N['. charnames::viacode(ord $_) . ']' } split m//, $_[0]; print "\n"; } my $u = String::Multibyte->new('Grapheme'); mydump $str; mydump $u->strrev($str); binmode STDOUT, ':utf8'; print $u->strrev($str), "\n"; __END__ \N[LATIN CAPITAL LETTER A]\N[COMBINING DIAERESIS]\N[LATIN CAPITAL LETT +ER O] \N[LATIN CAPITAL LETTER O]\N[LATIN CAPITAL LETTER A]\N[COMBINING DIAER +ESIS] OÄ

The String::Multibyte::Grapheme module helps you with reversing the string without tearing the graphemes apart.

(It can also count the number of graphemes, generate substrings with grapheme semantics and more; see String::Multibyte.)

(Updates: typos fixed, holli++, jdporter++, jrsimmon++, improvements suggested by graff++).

Replies are listed 'Best First'.
Re: [RFC] How to reverse a (text) string
by graff (Chancellor) on Dec 19, 2007 at 20:28 UTC
    Consider the letter "Ä", which has the Unicode name LATIN CAPITAL LETTER A WITH DIAERESIS. You could also write that as two Codepoints: LATIN CAPITAL LETTER A, COMBINING DIAERESIS. That is a base character, in this case "A", and a combining character, here the COMBINING DARESIS. Converting one representation into the other is called "Unicode (de)normalization".
    So far, so good -- but get rid of the "(de)": it's just "normalization", period.

    ... use Unicode::Normalize; ... mydump $str; mydump NFKD($str); mydump scalar reverse NFKD($str);
    I think it would make more sense to use "NFKC", which is described in the Unicode::Normalize man page as "compatibility decomposition followed by canonical composition" (emphasis in the original). NFKC is the function that yields "maximally composed" forms for all characters (i.e. the minimum number of separate, non-spacing diacritic marks adjacent to the characters they combine with), and does so according to canonical rules.

    Of course, there is still the problem that some languages commonly need characters with diacritics for which Unicode does not (yet) define a single combined character form -- e.g. some Central African languages use "E ACUTE WITH DOT BELOW"; but there is no such Unicode character, so this must be made using "E ACUTE" followed by "COMBINING DOT BELOW", or using "E WITH DOT BELOW" followed by "COMBINING ACUTE ACCENT". (If you apply NFKC to both of those combinations, only one of them will come out as the "canonical" form -- I forget which...)

    I know it can be tough to get one's head around the descriptions provided in the Unicode::Normalize man page, but in essence, NFKD makes sure that all diacritic marks are expanded out to separate code points, which sort of runs counter to what you might want for doing "reverse".

    For situations where combining marks are unavoidable (where NFKC cannot eliminate combining marks completely), using NFKD probably would not really help in any way -- where it has any effect at all, it would result in taking more steps (doing more iterations) to solve the basic reversal problem.

      Thanks for your feedback.

      I know that normally you wouldn't use NFKD for normalization to reverse a string, this was just a trick to show that 1) different normalizations can behave make reverse differently and 2) there are cases where it doesn't work the way you want.

      Surely I won't give an introduction to Unicode normalization here, partly because I don't really grok it, partly because it's just too big to fit in a tutorial that focuses on something else.

      Anyway, I hope the tutorial is now a bit clearer on why I used NFKD here, and that it's a normal thing to do.

      Erm ... if the transformation in one direction is "normalization", then the transformation in the oposite direction is denormalization, isn't it? Or would that be too logical for Unicode. Isn't just one of the ways to encode the graphemes supposed to be the "normal form"?

      In either case thanks both for uncovering an (what should I call it?!?) interesting feature of Unicode. I had no idea characters in Unicode can be not only multibyte, but also mutlicodepoint. I guess the commitee that invented Unicode was too big.

        if the transformation in one direction is "normalization", then the transformation in the oposite direction is denormalization, isn't it?

        Whether you are "composing" or "decomposing" elements of complex characters, you have to make choices about how the operation is done: which elements will be combined into a composed character (as in the "e acute with dot below" example), or how to order the decomposed elements. When those choices are made according to a codified set of rules (as opposed to your whim of the moment or random selection), that is normalization. The term applies in both directions.

        Erm ... if the transformation in one direction is "normalization", then the transformation in the oposite direction is denormalization, isn't it?

        The term dernormalization doesn't occur in either the normalization FAQ nor in the Unicode Normalization Forms report.

        Isn't just one of the ways to encode the graphemes supposed to be the "normal form"?

        But which one? There are good reasons for all of these forms.

        In either case thanks both for uncovering an (what should I call it?!?) interesting feature of Unicode. I had no idea characters in Unicode can be not only multibyte, but also mutlicodepoint. I guess the commitee that invented Unicode was too big.

        It's not the committee size, but rather the number of possible graphemes in all languages of the world. If Unicode had Codepoints larger than 2^32 you wouldn't be happy either, would you?

        And I think it is a quite natural approach to divide a grapheme into a base character and a decoration.

        It's sad that it makes programming harder, but if you oversimplify, you lose correctness.

        Sadly Perl 5's builtins don't work on the grapheme level, only Codepoint and Byte level. It's one of the many reasons why I'm looking forward to Perl 6...

Re: [RFC] How to reverse a (text) string
by moritz (Cardinal) on Jul 23, 2008 at 10:02 UTC
    As an update I now released Perl6::Str on CPAN. It allows you to work with strings on the grapheme level, like this:
    use Perl6::Str; use charnames qw(:full); my $str = Perl6::Str->new("A\N{COMBINING DIAERESIS}O"); print $str->reverse; # will print OÄ, as is DWIMmy.