Re: utf8 encoding and warnings woe

Perl has two kinds of strings. Strings of bytes and strings of characters. When you use use utf8, literals are decoded from a string of UTF-8 bytes into a string characters.

use Encode qw( decode );

$s = "&#39640;";
print(length($s), "\n");  # 3

$s = decode('utf8', "&#39640;");
print(length($s), "\n");  # 1

$s = do { use utf8; "&#39640;" };
print(length($s), "\n");  # 1
[download]

→ Replace 高 with the UTF-8 encoding of 高 in the above source.

Writting to a ':utf8' handle reverses the process.

use Encode qw( encode );

my $s = do { use utf8; "&#39640;" };
print(length($s), "\n");   # 1

open my $fh, '>:utf8', \$s2;
print $fh $s;
print(length($s2), "\n");  # 3

$s2 = encode('utf8', $s);
print(length($s2), "\n");  # 3
[download]

→ Replace 高 with the UTF-8 encoding of 高 in the above source.

In context that means $str contains 6 characters, while $output2 contains 18 bytes. Since both contain wildly different contents, it's to be expected that they behave differently.

The warning is given when a string of characters is outputed in :bytes mode (the default, the opposite of :utf8). The gibberish comes from encoding a string twice. One solution:

use strict;
use warnings;
use utf8;

use Encode qw( decode );

my $chars_src = "&#37319;&#26679;&#36895;&#29575;&#22826;&#39640;";
my $bytes_dst;

{
   open my $fh, '>:utf8', \$bytes_dst;
   print $fh $chars_src;
}

my $chars_dst = decode('utf8', $bytes_dst);

binmode(STDOUT, ':utf8');
print("$chars_src\n");
print("$chars_dst\n");
[download]

→ Replace 采样速率太&#39640 with the UTF-8 encoding of 采样速率太高 in the above source.

Another solution:

...

my $bytes_src = encode('utf8', $chars_src);

print("$bytes_src\n");
print("$bytes_dst\n");
[download]

Another solution:

...

binmode(STDOUT, ':utf8');
print("$chars_src\n");
binmode(STDOUT, ':bytes');
print("$bytes_dst\n");
[download]

I like using hungarian notation (bytes_ vs chars_) when dealing with encoded and unencoded values. Making Wrong Code Look Wrong is a good essay on the subject.

Update: Added extra solutions.
Update: Added "reverses the process" bit.

Comment on Re: utf8 encoding and warnings woe Select or Download Code