UTF-8 strings and the bytes pragma

trizen has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I have a question regarding a very strange behavior which implies the bytes pragma and two UTF-8 strings.

Bellow is the code which illustrates the problem:

use utf8;
use 5.010;
use strict;
use warnings;

binmode(STDOUT, ':encoding(UTF-8)');

sub get_bytes {
    my ($string) = @_;

    use bytes;
    map { bytes::ord bytes::substr($string, $_, 1) } 0 .. bytes::lengt
+h($string) - 1;
}

my $s1 = "møøse";
my $s2 = "m\xF8\xF8se";

say $s1;
say $s2;

say "Equal: ", $s1 eq $s2;

say join(" ", get_bytes($s1));
say join(" ", get_bytes($s2));
[download]

The output (with perl-5.22.0):

møøse
møøse
Equal: 1
109 195 184 195 184 115 101
109 248 248 115 101

As shown, the strings are equivalent in UTF-8, but they become different once they are converted into bytes.

Is this a bug or do I miss something important in this conversion? Thanks!

Update: by replacing "m\xF8\xF8se" with decode_utf8(encode_utf8("m\xF8\xF8se")) it seems to work as expected.

Comment on UTF-8 strings and the bytes pragma Select or Download Code

Replies are listed 'Best First'.
Re: UTF-8 strings and the bytes pragma by choroba (Cardinal) on Jun 19, 2015 at 15:30 UTC
They are different, I used Devel::Peek to check: `use Devel::Peek; # ... Dump $s1; Dump $s2; __END__ SV = PV(0x60003a520) at 0x600078300 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x600069210 "m\303\270\303\270se"\0 [UTF8 "m\x{f8}\x{f8}se"] CUR = 7 LEN = 16 SV = PV(0x60003a470) at 0x6000e2960 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x60016e4d0 "m\370\370se"\0 CUR = 5 LEN = 16` [download] Another way how to check the difference is to use `is_utf8()` from Encode. Read about the UTF8 flag in the documentation. `use Encode qw{ is_utf8 }; # ... say 'Same UTF8 flag: ', is_utf8($s1) == is_utf8($s2);` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: UTF-8 strings and the bytes pragma by ikegami (Patriarch) on Jun 19, 2015 at 18:44 UTC
As shown, the strings are equivalent in UTF-8, but they become different once they are converted into bytes The strings are the same, but they are stored differently. Just like the number one can be stored as IV = "\x00\x00\x00\x01" [+1] or NV = "\x00\x00\x00\x00\x00\x00\xF0\x3F" [+1.0 * 2*(1023-1023)] The string møøse can be stored as PV = "\x6D\xF8\xF8\x73\x65\x00" [møøse] or PV,UTF8 = "\x6D\xC3\xB8\xC3\xB8\x73\x65\x00" [møøse] Is this a bug* The bug is the use of the bytes module. Never use the bytes module. Any module that uses `get_bytes` or equivalent is buggy. If you want to convert the internal storage format of a scalar because are dealing with a buggy module, you can use builtins `utf8::upgrade` and `utf8::downgrade`. `# Converts $s to use the UTF8=1 storage format if it's not already. utf8::upgrade($s); # Converts $s to use the UTF8=0 storage format if it's not already. # Dies if it can't. utf8::downgrade($s); # Converts $s to use the UTF8=0 storage format if it's not already. # Returns false if it can't. utf8::downgrade($s, 1);` [download] But you hopefully never have to do that. If instead you are trying to convert a Unicode string to UTF-8 or vice-versa, you have a few options. `# From Unicode to UTF-8: utf8::encode($s); utf8::encode(my $utf8 = $uni); use Encode qw( encode ); my $utf8 = encode('UTF-8', $uni); use Encode qw( encode_utf8 ); my $utf8 = encode_utf8($uni); # From UTF-8 to Unicode: utf8::decode($s); utf8::decode(my $uni = $utf8); use Encode qw( decode ); my $uni = decode('UTF-8', $utf8); use Encode qw( decode_utf8 ); my $uni = decode_utf8($utf8);` [download] The utf8:: functions are built-in and work in-place. The Encode:: functions have more flexible error handling.	[reply] [d/l] [select]
Re: UTF-8 strings and the bytes pragma by Anonymous Monk on Jun 19, 2015 at 17:14 UTC
For strings Perl uses two encodings internally, one is like UTF-8 (with some differences), the other one isn't. That's supposed to be completely transparent for the programmer. `use utf8` just tells the compiler that the source code is in UTF-8. It doesn't say anything about how the strings are going to be stored internally.	[reply] [d/l]
Re^2: UTF-8 strings and the bytes pragma by trizen (Hermit) on Jun 19, 2015 at 17:37 UTC
In my opinion, this may lead to some inconsistencies. For example, when: my $s1 = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"; my $s2 = "\x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522}"; the bytes are the same: 240 157 148 152 240 157 148 171 240 157 148 166 240 157 148 160 240 157 148 172 240 157 148 161 240 157 148 162 240 157 148 152 240 157 148 171 240 157 148 166 240 157 148 160 240 157 148 172 240 157 148 161 240 157 148 162 I think it would be nice to have an way that automatically converts literal strings with hex escapes like `"\x{...}"` into UTF-8 strings.	[reply] [d/l]
Re^3: UTF-8 strings and the bytes pragma by Anonymous Monk on Jun 19, 2015 at 17:54 UTC
In my opinion, this may lead to some inconsistencies. Here's a relatively recent discussion about it on p5p mailing list: `http://www.nntp.perl.org/group/perl.perl5.porters/2015/01/msg224867.html`	[reply]