comment on

(c3 a4 is the utf8 codepoint of цє

No. Codepoints are numbers. c3 a4 is the UTF8 representation of codepoint 00E4:

$ perl -le'binmode STDOUT, ":utf8"; print "\x{00E4}";'|od -c
0000000 303 244  \n
0000003
[download]

Or, in a more legible form:

$ perl -CO -le'use charnames ":full"; print "\N{LATIN SMALL LETTER A W
+ITH DIAERESIS}";'|od -c
0000000 303 244  \n
0000003
[download]

This shows that the internal representation is in iso

You should not assume anything about the internal representation of perl strings. It may change in the future.

It surprises me than no one suggested Encode yet. With it, you can decode strings to Perl internal format, mangle them at your will and encode them back when printing them out:

$ perl |od -c
use Encode;
my $c = decode "latin1", "\xe4";
$c = uc $c;
$c = chr (1 + ord $c);     ## further mangling
print encode "latin1", $c;
__END__
0000000 305
0000001
$ perl |od -c
use Encode;
my $c = decode "latin1", "\xe4";
$c = uc $c;
$c = chr (1 + ord $c);
print encode "utf8", $c;   ## <-- change here
__END__
0000000 303 205
0000002
[download]

Furthermore on utf8 machines -CS should be enabled by default

I thought that too but it ended being a bad idea. Yes, great for UTF-8 encoded text files but, what if you're working with a binary? Instead of using binmode :raw on binaries, I chose to drop -C and binmode :utf8 on UTF-8 text files, like the rest of the world.

And, if you've not noticed yet, there's no mention of use utf8 in this post (well, almost ;^)). AIUI, utf8 serves a totally different purpose, namely:

use utf8;
my $А = 42;
print $А, "\n";
__END__
42
[download]

--
David Serrano

In reply to Re^2: bug in utf8 handling? by Hue-Bond
in thread bug in utf8 handling? by jethro

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.