http://qs1969.pair.com?node_id=11136919

mpersico has asked for the wisdom of the Perl Monks concerning the following question:

I am working with Inline::Python. In order to pass string data into Python, I am calling Data::Structure::Util::utf8_on() on all the data I am passing into any function call that is Python. It seems to work well except for the hash keys. Python complains: "TypeError: keys must be str, int, float, bool or None, not bytes at line 39". Turns out that in that module (sample here), only the values are set to utf8, not the keys.

Is there a way to set the keys to utf8 in XS? I was hoping for a HeKEY macro, but doing some poking around, I only see keys being retrieved, not set.

Failing that, would a simple:

my %newhash; for (keys %oldhash) { $newhash{utf8_on($_)} = $newhash{$_}; } call_pyfunc(\%newhash);
seem to be the right thing to do?

Thank you.

Replies are listed 'Best First'.
Re: How to convert hash keys to utf8
by NERDVANA (Deacon) on Sep 22, 2021 at 06:34 UTC
    Calling utf8_on is almost never the right thing to do. Can you explain what problem this is solving for you?

    To elaborate a bit on the flag, it only tells perl’s string implementation whether it should be processing the string using ascii rules or using utf8 rules. If you change the flag you’re probably just breaking things unless you know for a fact that the ascii bytes of the string are in fact a valid utf8 sequence of bytes. If you want to take a string of latin1 characters and make sure they are represented as utf8 before handing those bytes to Python, you should be calling utf8::upgrade, and if you read some utf8 from a file handle and want the string to understand that it contains characters and not bytes, you should call utf8::decode. Because perl uses utf8 internally, calling utf8::decode doesn’t actually change any bytes (the first time you call it) and is just like calling utf8_on except that it also verifies that the string contains valid utf8.

    Edit: Whoops, I didn’t read that carefully enough. I was expecting that the module you linked was recursively setting the utf8 flag in the manner of SvUTF8_on. Actually it does call upgrade, and the author just chose the name poorly.

    It sounds sort of like you are saying that Inline::Python refuses to serialize string data unless it has the utf8 flag set on the string. This sounds like a bug in Inline::Python. The correct serialization pattern would be to upgrade the string as it was getting serialized, preferably only on the bytes being moved and without altering the original SV.

    For your code example, it appears the only way to upgrade hash keys (in pure perl) is to rebuild the hash:

    for (keys %h) { utf8::upgrade($_); $h{$_}= $h{$_} }
    hash keys are not SV instances, and the way utf8 is indicated to hv_store is with a negative key length, and when you consider that any string containing a byte above 0x7f would need re-hashed… I think the only way is rebuilding the hash.
Re: How to convert hash keys to utf8
by hexcoder (Curate) on Sep 23, 2021 at 09:35 UTC
    I interpret your question as "how do i change the hash key in a hash?"

    If you can change the original hash, I suggest this code snippet, which deletes entries with the non-utf8 keys and reuses the values for the utf8-version key.

    for (keys %oldhash) { # change key and keep the value $oldhash{utf8_on($_)} = delete $oldhash{$_}; } call_pyfunc(\%oldhash);
      Neither _utf8_on (in Encode.pm) nor utf8_on (in DSU) mention a return value, _utf8_on (in DSU, probably different from Encode's) explicitly says
      The data structure is converted in-place and as a convenience the passed variable is returned from the function.
      Given mpersico's observation, it looks like for the hash itself it is not relevant whether its keys have an utf8 flag set or not.
      Interesting point. That's not what I did though. My solution was
      $ref->{ utf8_on($key) } = $ref->{$key};
      and yet, I did not double the size of the hash, as proven by tests I have written to count and enumerate the keys as received in Python from Perl. I believe that my code "works" without duplicating the keys because utf8_on is not changing or manipulating the actual string that is the key; all it is doing is manipulating the metadata.
Re: How to convert hash keys to utf8
by mpersico (Monk) on Sep 23, 2021 at 18:06 UTC
    In the end, I wrote this:
    use Data::Structure::Util qw(get_refs utf8_on); sub _perl_to_python_utf8_on { # We need to convert the hash keys ourselves. # 'get_refs' digs through the data structure # and returns an array ref of every reference # in the structure, at all levels. # All Hail CPAN; so glad I didn't have to walk # the data myself. my $refs = get_refs( $_[0] ); # For each ref, process just the hashes. for my $ref ( @{$refs} ) { if ( ref($ref) eq 'HASH' ) { for my $key ( keys %$ref ) { # See text below. $ref->{ utf8_on($key) } = $ref->{$key}; } } } # Let the utility convert everything else. return utf8_on( $_[0] ); }
    You might think that the statement $ref->{ utf8_on($key) } = $ref->{$key}; would produce duplicate keys, but it does not; I went back and wrote key counting tests to prove it after reading other comments in this thread. The key (pun intended) to understanding what's going on here is that I believe utf8_on() only twiddles metadata; it does not change the actual value of the key. Hence we don't get duplicate entries and do not need to do
    $ref->{ utf8_on($key) } = delete $ref->{$key}.
    You'd also think that I've now mismatched the key string and its metadata, but Inline::Python seems to be Doing the Right Thing with it, so I am leaving well enough alone for now. Maybe the utf8 'value' of each char is the same as its ascii value when less than 128 or 255?