However, in my example, the contents of the string passed to encode_utf8 were not code points - there were already in UTF8, and were therefore left unaltered by encode_utf8.
No. The string passed to encode_utf8 did contain codepoints; that's what Perl strings are. And the function returned something that was different from the original string, as can be seem below:
use strict;
use warnings;
use Encode;
use charnames qw(greek);
for ("ABCD", "ABC\N{delta}", "\N{alpha}\N{beta}\N{gamma}\N{delta}") {
printf "orig len=%d, enc len=%d\n",
length($_),
length(Encode::encode_utf8($_));
}
__END__
$ perl /tmp/p
orig len=4, enc len=4
orig len=4, enc len=5
orig len=4, enc len=8
How the string is internally represented in Perl is (almost always) completely irrelevant. Perl sees strings as a list of codepoints; typically if all the codepoints are < 256, perl stores them using one byte per codepoint; if any are >= 256, it stores them all as a variable number of bytes using (as it happens) utf8 encoding internally.
Regardless of a string's internal coding, Encode::encode_utf8() returns a string consisting of a
codepoint for each the octets of what would be the utf8 representation of the original string, ragardless of how that original string
was actually stored internally.
Dave. | [reply] [d/l] |
>No. The string passed to encode_utf8 did contain codepoints; that's what Perl strings are.
Hmm, this doesn't make sense to me: AFAIK Perl strings never store code points, but rather store the UTF-8 encoding of the code points e.g. the string with a Greek uppercase Kappa, whose code point is 039A:
$str = "\x{039A}";
does not contain, in hex, 039A, but rather in hex, CE9A,
the UTF8 encoding of that code point.
>And the function returned something that was different from the original string, as can be seem below:
What your example seems to demonstrate, AFAICS, is the
character v. byte o/p of length, when presented with
strings where the UTF-8 flag is switched on/off.
So for the final string, containing alpha, beta, gamma, and delta, it has a length of 4 characters, when Perl knows that it contains valid UTF-8, but a length of 8 when Perl is assuming the old byte=character semantics. However, both
the strings are byte-for-byte identical.
Or, if I'm wrong here, I'm very confused.
Steve Collyer | [reply] [d/l] |
sub encode_utf8 {
my $e;
for (map ord, split //, $_[0]) {
if ($_ < 128) {
$e .= chr($_);
}
elsif ($_ < 1024) {
$e .= chr(0xC0 + ($_ >> 6));
$e .= chr(0x80 + ($_ & 63));
}
elsif (...)
...
}
}
return $e;
}
Dave.
| [reply] [d/l] |
>If so, I guess that decode_utf8 should eat UTF8 encoded data, and spit out Unicode code points ?
I'll answer this myself: No, if you give decode_utf8 a string
containing octets that are already valid UTF-8, it should return a string byte-for-byte identical but with the UTF-8 flag switched on.
Steve Collyer | [reply] |