(c3 a4 is the utf8 codepoint of ä
No. Codepoints are numbers. c3 a4 is the UTF8 representation of codepoint 00E4:
$ perl -le'binmode STDOUT, ":utf8"; print "\x{00E4}";'|od -c 0000000 303 244 \n 0000003
Or, in a more legible form:
$ perl -CO -le'use charnames ":full"; print "\N{LATIN SMALL LETTER A W +ITH DIAERESIS}";'|od -c 0000000 303 244 \n 0000003
This shows that the internal representation is in iso
You should not assume anything about the internal representation of perl strings. It may change in the future.
It surprises me than no one suggested Encode yet. With it, you can decode strings to Perl internal format, mangle them at your will and encode them back when printing them out:
$ perl |od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); ## further mangling print encode "latin1", $c; __END__ 0000000 305 0000001 $ perl |od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); print encode "utf8", $c; ## <-- change here __END__ 0000000 303 205 0000002
Furthermore on utf8 machines -CS should be enabled by default
I thought that too but it ended being a bad idea. Yes, great for UTF-8 encoded text files but, what if you're working with a binary? Instead of using binmode :raw on binaries, I chose to drop -C and binmode :utf8 on UTF-8 text files, like the rest of the world.
And, if you've not noticed yet, there's no mention of use utf8 in this post (well, almost ;^)). AIUI, utf8 serves a totally different purpose, namely:
use utf8; my $á = 42; print $á, "\n"; __END__ 42
--
David Serrano
In reply to Re^2: bug in utf8 handling?
by Hue-Bond
in thread bug in utf8 handling?
by jethro
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |