Problem with unicode combination diacritics

muba has asked for the wisdom of the Perl Monks concerning the following question:

I have this script:

#!perl

use strict;
use warnings;

use charnames ":full";


my @letters = qw(a b c d e f g h i j k l m n o p q r s t u v w x y z);


my %accents = (
    grave       => chr(0x300),
    acute       => chr(0x301),
    circumflex  => chr(0x302),
    tilde       => chr(0x303),
    breve       => chr(0x306),
    diaeresis   => chr(0x308),
    ring        => chr(0x30A),
    doubleacute => chr(0x30B),
    doublegrave => chr(0x30F),
    cedilla     => chr(0x327),
);

open MEHH, ">unicode.txt";
binmode(MEHH, ":utf8");



foreach my $letter (@letters) {
    my $capital = -1;

    for (1..2) {
        $capital++;
   
        foreach my $accent (keys %accents) {
            my $name = "LATIN " . ($capital ? "CAPITAL " : "SMALL ") .
+ "LETTER " .  uc($letter);
            print MEHH chr(charnames::vianame($name)) . "$accents{$acc
+ent}  ($name $accent)\n";
        }

        print MEHH "\n";
    }

    print MEHH "\n\n";
}


close MEHH;
[download]

As you can see, it has to add certain accents to all letters of the alphabet, both lowercase and uppercase.
These accents are the Combining Diacritical Marks.
According to the unicode website, this means that the accent (or diacritical mark) is applied to the preceding character.
Well, nice, you'd say. But the problem is: it doesn't. Every Combining Diacritical Mark is displayed as a 0.

Why? What do I do wrong? How should I do this?

"2b"||!"2b";$$_="the question"
Besides that, my code is untested unless stated otherwise.
One more: please review the article about regular expressions (do's and don'ts) I'm working on.

Comment on Problem with unicode combination diacritics Select or Download Code

Replies are listed 'Best First'.
Re: Problem with unicode combination diacritics by graff (Chancellor) on Apr 15, 2005 at 02:08 UTC
When I ran your script on macosx, in a "Terminal" window with character encoding set to utf8, it displayed some of the lines with the expected single-column accented character (e.g. á ã à and so on), but for others, it displayed a digraph -- the unaccented character followed by the diactric in the second column. This is what I would expect, given that only some combinations of letters and diacritics are actually used in various human languages, and it's only the ones that are used that get a "unified glyph" in standard fonts. If I had a different process for displaying text -- particularly, one that treated all those letter-plus-accent sequences the same way (e.g. print the letter, backspace, then print the accent without erasing the letter, or detect the letter+accent sequence and print them both before advancing the cursor to the next column), then everything would be the way you want it. Instead, my process only knows how to "coalesce" a letter+accent sequence when it happens to match an accented character that exists in the font. (I guess whatever you're using to display the text, it doesn't know how to do even that much.) Bear in mind that while the unicode standard does set a "canonical ordering" for letters+accents when these are expressed as character sequences, it also says that pre-combined forms should be used in preference to sequences as a rule. (Of course, rules are made to be broken, but this is an area where breaking the rules might not be worth it.)	[reply]
Re: Problem with unicode combination diacritics by dave_the_m (Monsignor) on Apr 14, 2005 at 22:57 UTC
According to the unicode website, this means that the accent (or diacritical mark) is applied to the preceding character. Well, nice, you'd say. But the problem is: it doesn't. Every Combining Diacritical Mark is displayed as a 0. Perl is, asuming you're using 5.8.x, correctly outputting pairs of Unicode characters (such as an 'A' followed by a \x{300}); how and whether the combining takes place is the job of whatever display device you are using (such as a unicode-enabled terminal). Dave.	[reply]