(c3 a4 is the utf8 codepoint of ä

No. Codepoints are numbers. c3 a4 is the UTF8 representation of codepoint 00E4:

$ perl -le'binmode STDOUT, ":utf8"; print "\x{00E4}";'|od -c 0000000 303 244 \n 0000003

Or, in a more legible form:

$ perl -CO -le'use charnames ":full"; print "\N{LATIN SMALL LETTER A W +ITH DIAERESIS}";'|od -c 0000000 303 244 \n 0000003
This shows that the internal representation is in iso

You should not assume anything about the internal representation of perl strings. It may change in the future.

It surprises me than no one suggested Encode yet. With it, you can decode strings to Perl internal format, mangle them at your will and encode them back when printing them out:

$ perl |od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); ## further mangling print encode "latin1", $c; __END__ 0000000 305 0000001 $ perl |od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); print encode "utf8", $c; ## <-- change here __END__ 0000000 303 205 0000002
Furthermore on utf8 machines -CS should be enabled by default

I thought that too but it ended being a bad idea. Yes, great for UTF-8 encoded text files but, what if you're working with a binary? Instead of using binmode :raw on binaries, I chose to drop -C and binmode :utf8 on UTF-8 text files, like the rest of the world.

And, if you've not noticed yet, there's no mention of use utf8 in this post (well, almost ;^)). AIUI, utf8 serves a totally different purpose, namely:

use utf8; my $á = 42; print $á, "\n"; __END__ 42

--
David Serrano


In reply to Re^2: bug in utf8 handling? by Hue-Bond
in thread bug in utf8 handling? by jethro

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.