Re^3: Help with Accented Characters

First, what is wrong with the original output? I see that the accented characters did not get their case converted. Is that the only problem? I see the accented letters in your output just fine, so I think that UTF-8 input and output is working OK.

The Unicode character is code point U+00E9, "Latin small letter e with acute". That is an integer, in the abstract mathematical sense. The character is E9. If you split the string into characters and print the ordinal of each one, E9 is what you will get.

When you encode the character as a sequence of bytes using UTF-8, the character U+00E9 will be encoded as two bytes, C3 A9. But Perl hides this from you. If the string is holding characters (as opposed to holding bytes) the implementation details will include the fact that those two bytes are in memory, but splitting into characters will include both bytes in one such character, and ord will know how to turn that into an integer.

Actually, the Perl docs confuse the meaning of character and code point. The above doesn't consider that a single grapheme might be composed of several code points, such as a base letter A followed by a modifier "acute accent above". Now your new output: � is HTML encoding for "Replacement Character", normally shown as a diamond with a question mark inside. This means that with UTF-8 enabled, which turned on the Unicode version of uc and lc, it did not know how to convert é so used this as the error replacement. I don't know why you are missin the final character in two of the lines; perhaps a cut and paste problem?

Capitalization, in general, is language specific. I agree that a generic routine should convert é to É. Only if it knows you are writing French, where capital letters don't have their marks shown, would it map é to E. I don't know enough about the implementation to tell you why the function failed when use utf8 was used.

—John

Comment on Re^3: Help with Accented Characters Download Code