First, what is wrong with the original output? I see that the accented characters did not get their case converted. Is that the only problem? I see the accented letters in your output just fine, so I think that UTF-8 input and output is working OK.

The Unicode character is code point U+00E9, "Latin small letter e with acute". That is an integer, in the abstract mathematical sense. The character is E9. If you split the string into characters and print the ordinal of each one, E9 is what you will get.

When you encode the character as a sequence of bytes using UTF-8, the character U+00E9 will be encoded as two bytes, C3 A9. But Perl hides this from you. If the string is holding characters (as opposed to holding bytes) the implementation details will include the fact that those two bytes are in memory, but splitting into characters will include both bytes in one such character, and ord will know how to turn that into an integer.

Actually, the Perl docs confuse the meaning of character and code point. The above doesn't consider that a single grapheme might be composed of several code points, such as a base letter A followed by a modifier "acute accent above". Now your new output: � is HTML encoding for "Replacement Character", normally shown as a diamond with a question mark inside. This means that with UTF-8 enabled, which turned on the Unicode version of uc and lc, it did not know how to convert é so used this as the error replacement. I don't know why you are missin the final character in two of the lines; perhaps a cut and paste problem?

Capitalization, in general, is language specific. I agree that a generic routine should convert é to É. Only if it knows you are writing French, where capital letters don't have their marks shown, would it map é to E. I don't know enough about the implementation to tell you why the function failed when use utf8 was used.

—John


In reply to Re^3: Help with Accented Characters by John M. Dlugosz
in thread Help with Accented Characters by shawnhcorey

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.