in reply to What does Encode::encode_utf8 do to UTF-8 data ?

AFAICS it merely removes the UTF-8 flag, as the program below seems to demonstrate
Yes,and that's what it's supposed to do.
use strict; use warnings; use Encode; $a = "\x{100}"; $b = Encode::encode_utf8($a); print "a = ", join(',', map ord($_), split //, $a), "\n"; print "b = ", join(',', map ord($_), split //, $b), "\n"; __END__ $ perl /tmp/p a = 256 b = 196,128 $
$a is a string containing 1 character, which happens to have a utf8 representation that takes two octets; $b is is string containing 2 characters, which represent each of the octets of $a's utf8 representation.

PS your if statement has the wrong logic; it prints "differ" when eq matches.

Dave.

Replies are listed 'Best First'.
Re^2: What does Encode::encode_utf8 do to UTF-8 data ?
by scollyer (Sexton) on Oct 03, 2005 at 12:32 UTC
    >$a is a string containing 1 character, which happens to have
    >a utf8 representation that takes two octets; $b is is string
    >containing 2 characters, which represent each of the octets
    >of $a's utf8 representation.

    OK, I think I may understand. In your example, encode_utf8 seems to be eating a single Unicode code point and spitting out its UTF-8 encoding.

    However, in my example, the contents of the string passed to encode_utf8 were not code points - there were already in UTF8, and were therefore left unaltered by encode_utf8.

    Is this right ? If so, I guess that decode_utf8 should eat UTF8 encoded data, and spit out Unicode code points ?

    Steve Collyer

    PS: Oops !

      However, in my example, the contents of the string passed to encode_utf8 were not code points - there were already in UTF8, and were therefore left unaltered by encode_utf8.
      No. The string passed to encode_utf8 did contain codepoints; that's what Perl strings are. And the function returned something that was different from the original string, as can be seem below:
      use strict; use warnings; use Encode; use charnames qw(greek); for ("ABCD", "ABC\N{delta}", "\N{alpha}\N{beta}\N{gamma}\N{delta}") { printf "orig len=%d, enc len=%d\n", length($_), length(Encode::encode_utf8($_)); } __END__ $ perl /tmp/p orig len=4, enc len=4 orig len=4, enc len=5 orig len=4, enc len=8
      How the string is internally represented in Perl is (almost always) completely irrelevant. Perl sees strings as a list of codepoints; typically if all the codepoints are < 256, perl stores them using one byte per codepoint; if any are >= 256, it stores them all as a variable number of bytes using (as it happens) utf8 encoding internally.

      Regardless of a string's internal coding, Encode::encode_utf8() returns a string consisting of a codepoint for each the octets of what would be the utf8 representation of the original string, ragardless of how that original string was actually stored internally.

      Dave.

        >No. The string passed to encode_utf8 did contain codepoints; that's what Perl strings are.

        Hmm, this doesn't make sense to me: AFAIK Perl strings never store code points, but rather store the UTF-8 encoding of the code points e.g. the string with a Greek uppercase Kappa, whose code point is 039A:

        $str = "\x{039A}";
        does not contain, in hex, 039A, but rather in hex, CE9A, the UTF8 encoding of that code point.

        >And the function returned something that was different from the original string, as can be seem below:

        What your example seems to demonstrate, AFAICS, is the character v. byte o/p of length, when presented with strings where the UTF-8 flag is switched on/off.

        So for the final string, containing alpha, beta, gamma, and delta, it has a length of 4 characters, when Perl knows that it contains valid UTF-8, but a length of 8 when Perl is assuming the old byte=character semantics. However, both the strings are byte-for-byte identical.

        Or, if I'm wrong here, I'm very confused.

        Steve Collyer

      >If so, I guess that decode_utf8 should eat UTF8 encoded data, and spit out Unicode code points ?

      I'll answer this myself: No, if you give decode_utf8 a string containing octets that are already valid UTF-8, it should return a string byte-for-byte identical but with the UTF-8 flag switched on.

      Steve Collyer