in reply to Wide characters and UTF8

The documentation for decode_json says it takes a UTF8 encoded string.

This is the doc:

decode_json
$perl_scalar = decode_json $json_text

The opposite of encode_json: expects an UTF-8 (binary) string and tries to parse that as an UTF-8 encoded JSON text, returning the resulting reference. Croaks on error.

(Emphasis mine)

decode_json expects BYTES, not UTF-8 CHARACTERS. Feed it the non-decoded file (i.e. open raw, not with :encoding) and everything shoud work.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^2: Wide characters and UTF8
by Bod (Parson) on Nov 08, 2023 at 17:50 UTC

    Thanks - that has got me much further...

    So I am clear, is the :encoding in open telling Perl how the file is currently encoded or is it instructing Perl to encode the data?

      The important thing to know about Perl unicode support is that Perl does not track the type of a scalar. You, the programmer, need to keep track of whether you have a string of bytes or a string of unicode characters. The easiest way to do this is always decode bytes (like utf8 or utf16) into characters the moment it enters the program, like with your ":encoding(UTF-8)" mode.

      As it happens, the decode_json function expects bytes as input, assuming you haven't done the decoding yet, and then it both decodes UTF-8 and parses JSON at the same time. On the other hand, if you say JSON->new->decode($string) that assumes you provided it with a unicode string.

      So in summary:

      open my $fh, '<', $filename; $bytes= <$fh>; $data= decode_json($bytes);
      or
      open my $fh, '<:encoding(UTF-8)', $filename; $chars= <$fh>; $data= JSON->new->decode($chars);

      The :encoding tells Perl what encoding the data in the file is in, and Perl will then decode the data and give you Unicode strings.