in reply to Problem with unicode combination diacritics

When I ran your script on macosx, in a "Terminal" window with character encoding set to utf8, it displayed some of the lines with the expected single-column accented character (e.g. á ã à and so on), but for others, it displayed a digraph -- the unaccented character followed by the diactric in the second column.

This is what I would expect, given that only some combinations of letters and diacritics are actually used in various human languages, and it's only the ones that are used that get a "unified glyph" in standard fonts.

If I had a different process for displaying text -- particularly, one that treated all those letter-plus-accent sequences the same way (e.g. print the letter, backspace, then print the accent without erasing the letter, or detect the letter+accent sequence and print them both before advancing the cursor to the next column), then everything would be the way you want it. Instead, my process only knows how to "coalesce" a letter+accent sequence when it happens to match an accented character that exists in the font. (I guess whatever you're using to display the text, it doesn't know how to do even that much.)

Bear in mind that while the unicode standard does set a "canonical ordering" for letters+accents when these are expressed as character sequences, it also says that pre-combined forms should be used in preference to sequences as a rule. (Of course, rules are made to be broken, but this is an area where breaking the rules might not be worth it.)

  • Comment on Re: Problem with unicode combination diacritics