Dear Monks
I've stumbled on a strange behaviour with hash keys, happening on every Perl version I could test from 5.16 to 5.26
It has been asked some years ago on stack overflow but without any answer on whether it is an optimization bug or an expected behaviour.
The issue is that if you initialize a hash with a key having non-ascii (for eg. iso-8859-1) characters, the key is properly encoded in UTF8 (with UTF8 flag on). But then if you assign a value to the hash element corresponding to this key, the key is downgraded (probably encoded in iso-8859-1). You can imagine the consequences if you have to do some processing on this key, expecting it to be UTF8 encoded…
Here's a script showing the issue:
#!/usr/bin/perl use strict; use warnings; use utf8; use Devel::Peek; use Data::Dumper; $Data::Dumper::Useqq = 1; my %hash = ( 'clé' => 0, ); my $key = (keys %hash)[0]; Dump($key); print Dumper($key); $hash{'clé'} = 1; $key = (keys %hash)[0]; Dump($key); print Dumper($key); utf8::upgrade($key); Dump($key); print Dumper($key);
with the following output:
SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x555ed1993ed0 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 5 $VAR1 = "cl\x{e9}"; SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x555ed1909b10 "cl\351" CUR = 3 LEN = 0 $VAR1 = "cl\351"; SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x555ed1825350 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 10 $VAR1 = "cl\x{e9}";
As shown with this code, the issue can be solved by upgrading the key to UTF8. But I would never have thought I should have done it before stumbling to this issue. I've never read anything in perldoc explaining this behaviour. Do you think it's expected for some reason ? Thanks!
In reply to UTF8 hash key downgraded when assigned by gibus
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |