Dear Monks

I've stumbled on a strange behaviour with hash keys, happening on every Perl version I could test from 5.16 to 5.26

It has been asked some years ago on stack overflow but without any answer on whether it is an optimization bug or an expected behaviour.

The issue is that if you initialize a hash with a key having non-ascii (for eg. iso-8859-1) characters, the key is properly encoded in UTF8 (with UTF8 flag on). But then if you assign a value to the hash element corresponding to this key, the key is downgraded (probably encoded in iso-8859-1). You can imagine the consequences if you have to do some processing on this key, expecting it to be UTF8 encoded…

Here's a script showing the issue:

#!/usr/bin/perl use strict; use warnings; use utf8; use Devel::Peek; use Data::Dumper; $Data::Dumper::Useqq = 1; my %hash = ( 'clé' => 0, ); my $key = (keys %hash)[0]; Dump($key); print Dumper($key); $hash{'clé'} = 1; $key = (keys %hash)[0]; Dump($key); print Dumper($key); utf8::upgrade($key); Dump($key); print Dumper($key);

with the following output:

SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x555ed1993ed0 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 5 $VAR1 = "cl\x{e9}"; SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x555ed1909b10 "cl\351" CUR = 3 LEN = 0 $VAR1 = "cl\351"; SV = PV(0x555ed17dfe60) at 0x555ed1809710 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x555ed1825350 "cl\303\251"\0 [UTF8 "cl\x{e9}"] CUR = 4 LEN = 10 $VAR1 = "cl\x{e9}";

As shown with this code, the issue can be solved by upgrading the key to UTF8. But I would never have thought I should have done it before stumbling to this issue. I've never read anything in perldoc explaining this behaviour. Do you think it's expected for some reason ? Thanks!


In reply to UTF8 hash key downgraded when assigned by gibus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.