scollyer has asked for the wisdom of the Perl Monks concerning the following question:

Can anyone explain what happens when Encode::encode_utf8 does when applied to a string with the UTF-8 flag switched on ?

AFAICS it merely removes the UTF-8 flag, as the program below seems to demonstrate.

The Encode documentation says that encode(ENCODING, ..) "Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets"; Now, my understanding is that Perl's internal encoding *is* UTF-8, so that when applied to a string with the UTF-8 flag on, Encode::encode_utf8 is essentially a no-op, and merely switches off the UTF-8 flag.

Am I confused ?

Steve Collyer

################# code follows #######################

#!/usr/bin/perl use strict; use warnings; use Encode; use charnames qw(greek); binmode(STDOUT, ":utf8"); my $utf8_data = "<\N{alpha}\N{beta}\N{gamma}\N{delta}>"; print $utf8_data, "\n\n"; my $enc_utf8_data = Encode::encode_utf8($utf8_data); print Encode::is_utf8($enc_utf8_data) ? "\$enc_utf8_data marked as UTF-8\n\n" : "\$enc_utf8_data not marked as UTF-8\n\n"; print Encode::is_utf8($utf8_data) ? "\$utf8_data marked as UTF-8\n\n" : "\$utf8_data not marked as UTF-8\n\n"; if ($utf8_data eq $enc_utf8_data) { print "strings differ\n"; print "utf8_data ", unpack("H*", $utf8_data), "\n"; print "enc_utf8_data ", unpack("H*", $enc_utf8_data), "\n"; } else { print "strings are the same\n"; print "utf8_data ", unpack("H*", $utf8_data), "\n"; print "enc_utf8_data ", unpack("H*", $enc_utf8_data), "\n"; }

Replies are listed 'Best First'.
Re: What does Encode::encode_utf8 do to UTF-8 data ?
by dave_the_m (Monsignor) on Oct 03, 2005 at 11:15 UTC
    AFAICS it merely removes the UTF-8 flag, as the program below seems to demonstrate
    Yes,and that's what it's supposed to do.
    use strict; use warnings; use Encode; $a = "\x{100}"; $b = Encode::encode_utf8($a); print "a = ", join(',', map ord($_), split //, $a), "\n"; print "b = ", join(',', map ord($_), split //, $b), "\n"; __END__ $ perl /tmp/p a = 256 b = 196,128 $
    $a is a string containing 1 character, which happens to have a utf8 representation that takes two octets; $b is is string containing 2 characters, which represent each of the octets of $a's utf8 representation.

    PS your if statement has the wrong logic; it prints "differ" when eq matches.

    Dave.

      >$a is a string containing 1 character, which happens to have
      >a utf8 representation that takes two octets; $b is is string
      >containing 2 characters, which represent each of the octets
      >of $a's utf8 representation.

      OK, I think I may understand. In your example, encode_utf8 seems to be eating a single Unicode code point and spitting out its UTF-8 encoding.

      However, in my example, the contents of the string passed to encode_utf8 were not code points - there were already in UTF8, and were therefore left unaltered by encode_utf8.

      Is this right ? If so, I guess that decode_utf8 should eat UTF8 encoded data, and spit out Unicode code points ?

      Steve Collyer

      PS: Oops !

        However, in my example, the contents of the string passed to encode_utf8 were not code points - there were already in UTF8, and were therefore left unaltered by encode_utf8.
        No. The string passed to encode_utf8 did contain codepoints; that's what Perl strings are. And the function returned something that was different from the original string, as can be seem below:
        use strict; use warnings; use Encode; use charnames qw(greek); for ("ABCD", "ABC\N{delta}", "\N{alpha}\N{beta}\N{gamma}\N{delta}") { printf "orig len=%d, enc len=%d\n", length($_), length(Encode::encode_utf8($_)); } __END__ $ perl /tmp/p orig len=4, enc len=4 orig len=4, enc len=5 orig len=4, enc len=8
        How the string is internally represented in Perl is (almost always) completely irrelevant. Perl sees strings as a list of codepoints; typically if all the codepoints are < 256, perl stores them using one byte per codepoint; if any are >= 256, it stores them all as a variable number of bytes using (as it happens) utf8 encoding internally.

        Regardless of a string's internal coding, Encode::encode_utf8() returns a string consisting of a codepoint for each the octets of what would be the utf8 representation of the original string, ragardless of how that original string was actually stored internally.

        Dave.

        >If so, I guess that decode_utf8 should eat UTF8 encoded data, and spit out Unicode code points ?

        I'll answer this myself: No, if you give decode_utf8 a string containing octets that are already valid UTF-8, it should return a string byte-for-byte identical but with the UTF-8 flag switched on.

        Steve Collyer

Re: What does Encode::encode_utf8 do to UTF-8 data ?
by tphyahoo (Vicar) on Oct 03, 2005 at 10:34 UTC
    Well I can't answer your question directly, but I will say that the difficult thing for me with utf-8 for me was wading through the documentation. A web site that helped me cut to the good stuff on utf-8 and perl was was Unicode-processing issues in Perl and how to cope with it, which mentions utf8_encode in a coupel of the examples.

    UPDATE: fixed link