in reply to UTF-8 strings and the bytes pragma
As shown, the strings are equivalent in UTF-8, but they become different once they are converted into bytes
The strings are the same, but they are stored differently.
Just like the number one can be stored as
IV = "\x00\x00\x00\x01" [+1]
or
NV = "\x00\x00\x00\x00\x00\x00\xF0\x3F" [+1.0 * 2**(1023-1023)]
The string møøse can be stored as
PV = "\x6D\xF8\xF8\x73\x65\x00" [møøse]
or
PV,UTF8 = "\x6D\xC3\xB8\xC3\xB8\x73\x65\x00" [møøse]
Is this a bug
The bug is the use of the bytes module. Never use the bytes module.
Any module that uses get_bytes or equivalent is buggy. If you want to convert the internal storage format of a scalar because are dealing with a buggy module, you can use builtins utf8::upgrade and utf8::downgrade.
# Converts $s to use the UTF8=1 storage format if it's not already. utf8::upgrade($s); # Converts $s to use the UTF8=0 storage format if it's not already. # Dies if it can't. utf8::downgrade($s); # Converts $s to use the UTF8=0 storage format if it's not already. # Returns false if it can't. utf8::downgrade($s, 1);
But you hopefully never have to do that. If instead you are trying to convert a Unicode string to UTF-8 or vice-versa, you have a few options.
# From Unicode to UTF-8: utf8::encode($s); utf8::encode(my $utf8 = $uni); use Encode qw( encode ); my $utf8 = encode('UTF-8', $uni); use Encode qw( encode_utf8 ); my $utf8 = encode_utf8($uni); # From UTF-8 to Unicode: utf8::decode($s); utf8::decode(my $uni = $utf8); use Encode qw( decode ); my $uni = decode('UTF-8', $utf8); use Encode qw( decode_utf8 ); my $uni = decode_utf8($utf8);
The utf8:: functions are built-in and work in-place. The Encode:: functions have more flexible error handling.
|
|---|