Re: UTF-8 strings and the bytes pragma

As shown, the strings are equivalent in UTF-8, but they become different once they are converted into bytes

The strings are the same, but they are stored differently.

Just like the number one can be stored as

IV = "\x00\x00\x00\x01" [+1]
or
NV = "\x00\x00\x00\x00\x00\x00\xF0\x3F" [+1.0 * 2**(1023-1023)]

The string møøse can be stored as

PV = "\x6D\xF8\xF8\x73\x65\x00" [møøse]
or
PV,UTF8 = "\x6D\xC3\xB8\xC3\xB8\x73\x65\x00" [møøse]

Is this a bug

The bug is the use of the bytes module. Never use the bytes module.

Any module that uses get_bytes or equivalent is buggy. If you want to convert the internal storage format of a scalar because are dealing with a buggy module, you can use builtins utf8::upgrade and utf8::downgrade.

# Converts $s to use the UTF8=1 storage format if it's not already.
utf8::upgrade($s);

# Converts $s to use the UTF8=0 storage format if it's not already.
# Dies if it can't.
utf8::downgrade($s);

# Converts $s to use the UTF8=0 storage format if it's not already.
# Returns false if it can't.
utf8::downgrade($s, 1);
[download]

But you hopefully never have to do that. If instead you are trying to convert a Unicode string to UTF-8 or vice-versa, you have a few options.

# From Unicode to UTF-8:
utf8::encode($s);
utf8::encode(my $utf8 = $uni);
use Encode qw( encode ); my $utf8 = encode('UTF-8', $uni);
use Encode qw( encode_utf8 ); my $utf8 = encode_utf8($uni);

# From UTF-8 to Unicode:
utf8::decode($s);
utf8::decode(my $uni = $utf8);
use Encode qw( decode ); my $uni = decode('UTF-8', $utf8);
use Encode qw( decode_utf8 ); my $uni = decode_utf8($utf8);
[download]

The utf8:: functions are built-in and work in-place. The Encode:: functions have more flexible error handling.

Comment on Re: UTF-8 strings and the bytes pragma Select or Download Code